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Editorial Pointers 


ONLINE SEARCH IS PART OF DAILY LIFE, WITH 

popular search engines and digital libraries typically 
supporting users best able to define their information 
needs. The hitch comes when those needs are 
sketchy or the knowledge necessary is absent. How 
do users find the information they need if they don’t 
know it exists? How do they formulate an effective 
query? How do they steer their way through a com- 
plex site to find what they are looking for? 

The answers may lie in a new wave of R&D efforts to improve 
search interfaces and tools, design new ones, and create more effective 
exploration strategies. This wave, billed as exploratory search, is gather- 
ing momentum among a growing number of communities, including 
information retrieval, user interface design, library science, information 
visualization, and more. As guest editors Ryen White, Bill Kules, 
Steven Drucker, and m.c. schraefel contend, search tools must evolve as 
user requirements evolve from using search for lookup to using it to 
learn, investigate, and explore. 


ALSO IN THIS ISSUE, LI ET AL. DETAIL THE WRITEPRINT METHOD 

for tracking online criminal activity by tracing an author's writing 
style to identify authors of online messages linked to cybercrime 
investigations. Nenad Jukic examines the various approaches to mod- 
eling a data warehouse. And Andrea Ordanini discusses the three key 
elements—content, governance, and structure—for building a suc- 
cessful B2B marketplace. 

Does color play a role in how we respond to an email message? Yes, 
according to Zviran et al., who claim different colors prompt different 
reactions. Vidal and Mulet wonder what the next generation of CAD 
systems will offer. And Enns et al. debunk the IT professional stereo- 
type, urging HR professionals to do the same. 

In “Viewpoint,” Argamon and Olsen contend online access to the 
world’s knowledge is only half the story, urging computer scientists to 
join the effort to add meaning to the emerging global digital library. In 
“The Profession of IT,” Peter Denning describes how to assemble hasti- 
ly formed networks in response to disaster relief and emergencies. In 
“Digital Village,” Hal Berghel tells some compelling phish tales that 
help hook online schemers. And David Patterson (“President’s Letter”) 
and Lauren Weinstein (“Inside Risks”) share some razor-sharp observa- 


tions, taking April Fool’s Day to heart. 
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News Track. 


THE Bic Sweep 
The U.S. government is developing a massive com- 
puter system that could someday collect huge 
amounts of data and, by linking information from 
blogs and email messages to government records 
and intelligence reports, search for patterns of ter- 
rorist activity. The Christian Science Monitor reports 
this little-known system—called Analysis, Dissemi- 
nation, Visualization, Insight, and Semantic 
Enhancement (ADVISE)—is an R&D program 
within the Department of Homeland Security. The 
scope of the project, mentioned in only a few public 
documents, is what sets it apart. The system would 
collect a vast array of corporate and public online 
information and cross-reference it against U.S. law- 
enforcement records. It would then store findings as 
“entities,” that is, linked data about people, places, 
and events, among other categories. The system 
would be big enough to retain information on 
approximately one quadrillion entities. If each entity 
were a penny, they would collectively form a cube 
approximately double the height of the 
Empire State Building. Said one project 
spokesperson, the key is not merely to iden- 
tify terrorists or sift through key words but 
to identify critical patterns in data that 
illuminate motives and intentions. 


BMW BLasTeD 
Google has reportedly blacklisted BMW 
for breaching its guidelines, claiming the 
German car manufacturer's Web site influenced 
search results to ensure top ranking when users 
searched for “used cars.” BBC News Online reports 
Google countered by reducing BMW’s page rank to 
zero, thus guaranteeing the company no longer finds a 


top spot. BMW has admitted using “doorway pages” 


WVU 4! 


Web image Groups News more 


AMERICANS TURN TO THE NET 
21 million use it to get additional career training. 
17 million use it for dealing with major illnesses. 
17 million use it for choosing a school for a child. 
16 million use it to buy a car. 


16 million use it for a major financial decision. 


10 million use it for finding a new place to live. 
8 million use it when changing jobs. 
7 million use it to cope with family illness. 


Source: Pew Internet and American Life Project 


to boost search rankings but denied it was trying to 
mislead users. The BMW Web site, which is heavily 
reliant on JavaScript code unsearchable by Google, used 
text-heavy pages sprinkled with key words to attract 
Google’s indexing system. However, once users clicked 
on a link in Google’s results window, they were redi- 
rected to a regular BMW Germany page containing 
fewer key words. BMW contends it did not provide 
different content in the search results to the final Web 
site. Google points out its guidelines involve a series of 
requirements, first being to design Web 
sites for users, not for search engines; 
any attempt to manipulate searches 
will not be tolerated. 


HIGH-PERFORMANCE HELP 
The U.S. Department of Energy’s 
Office of Science awarded 18.2 mil- 
lion hours of computing time on 
some of the world’s most powerful supercom- 
puters to help researchers in government labs, 
universities, and industry working on projects as 
diverse as ecological processes affecting climate 
change to better understanding Parkinson's disease. 
The allocation of computer time is made under 
DOE's Innovative and Novel Computational Impact 


ACM ISSUES OFFSHORING REPORT 


~ 


At press time, ACM released the results of a comprehensive international study entitled “Globalization and 
Offshoring of Software—A Report of the ACM Job Migration Task Force.” To view the complete report, 


please visit www.acm.org/globalizationreport. 
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on Theory and Experiment (INCITE) program, 
providing resources to computationally intensive 
research projects in the national interest. For the 
first time in INCITE’s three-year history, proposals 
from private-sector researchers were encouraged. In 
return, much of the resulting knowledge will be 
made public. Among the companies receiving 
awards are Boeing, DreamWorks Animation, Gen- 
eral Atomics Co., Tech-X Corp., and Pratt & 
Whitney. Academic and research institutions to 
receive computing time include Caltech, Harvard, 
Howard Hughes Medical Center, the University of 
Strathclyde, and the University of California, 
Berkeley. 


THE Boss Is WATCHING 
An increasing number of British businesses are 
using mobile-phone tracking technology to moni- 
tor their employees, company cars, and equipment. 
Reuters reports that business is booming for firms 
offering tracking services in the U.K., 
where corporate surveillance in the 
name of “operational efficiency” 
is escalating. The benefits 
are many, say firms 
employing such ser- 
vices, ranging from 
discovering workers 
have been delayed at 
the pub—not by a 
traffic jam—to being able to quickly locate staff 
members or reroute them if necessary. Richard 
Wildings, a professor of supply chain manage- 
ment, says a company that knows where its 
employees, products, and equipment are could 
derive significant cost benefits if it uses the infor- 
mation effectively. U.K. government regulations 
stipulate that employees must be informed they are 
being monitored; companies cannot engage in 
such action covertly. Civil rights groups, however, 
worry that some workers will be coerced into 
being monitored. 


GOVERNMENT’S ROLE IN AsIA’s SOFTWARE 
The vast majority of IT professionals in Asia contend 
government must help foster the area’s software 
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industry. ZDNetAsia reports that 90% of the 800 IT 
(private-sector) professionals surveyed in China, 
Indonesia, Malaysia, Singapore, South Korea, Thai- 
land, Vietnam, and the Philippines agreed govern- 
ment is a key aspect of the success of the region’s 
software industry. Although an overwhelming major- 
ity wanted government to play an active role in 
developing the industry, 48% felt that regulations 
should be left to industry. Indeed, only 25% of the 
respondents felt government should be in charge of 


making policy. 


FAsT(ER) Foop To Go 
The fast food industry is moving with 
all deliberate speed to 
devise ways to deliver 
even faster service to 
customers. Since 
drive-thru service 
represents a huge por- 
tion of corporate sales for 
many chains, the focus is primarily on the 
use of technology to assemble orders, col- 
lect payment, and deliver food to dri- 
vers. The Associated Press reports that 
tech firms have been tapped to create digi- 
tal menus that increase sales by suggesting 
side dishes or desserts to customers’ orders. 
ame Some McDonald's facilities are using cen- 

tral call centers rather than restaurant 

cashiers to take drive-thru orders. Smaller 
chains have started testing confirmation screens, 
which display orders back to customers so they can 
make corrections before pulling up to the window. 
Hyperactive Bob, a computer system from Pitts- 
burgh-based Hyperactive Technologies, tells man- 
agers how much food to prepare by counting waiting 
vehicles and factoring in demand for current promo- 
tions and popular staple items. And Burger King is 
working to break the speed barrier with technology 
that helps cooks ensure that precooked food stays 
fresh by keeping track of how long it’s been since it 
was prepared. 


Send items of interest to cacm@acm.org 
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To Produce a Good Plan, Start 
with a Good Process 


HILE PHILLIP G. 

Armour’s “The Business 

of Software” column 
(“Counting Boulders and Measur- 
ing Mountains,” Jan. 2006) was 
based on good ideas, many of 
them were overshadowed by errors 
and omissions that made it diffi- 
cult to understand what he was 
getting at. A systems perspective is 
necessary, but even systems engi- 
neers can miss the wider view, 
being wrapped up in possibly 
faulty processes or lacking the 
experience to see beyond their 
own domain. 

Failing to see the big picture, 
software engineers can end up 
making suboptimal if not outright 
bad decisions. Who hasn't seen a 
programmer make an unautho- 
rized “improvement” only to cause 
other working, tested, accepted 
modules to begin to fail or delay 
testing and increase costs? 

Armour didn’t make a clear dis- 
tinction between planning and 
execution. For example, were the 
project’s estimates for budget plan- 
ning or for execution? Were they 
for initial sizing/planning or for an 
actual bid to a customer? Risk also 
must be considered, along with 
strategic objectives. Were they 


purely profit or perhaps to gain 
experience from bidding on or 
doing the project? Did the devel- 
opers have to achieve a given 
result and know the resources— 
time, money, people—they 
needed to achieve it? Would the 
project have been further compli- 
cated if it involved an absolute 
deadline? Or was it a need-it-yes- 
terday contract following a 9/11- 
type emergency? The latter would 
be closer to a research effort than 
to an engineering project. If that 
were the case, time would be 
imperative and costs essentially 
irrelevant. Yet other projects might 
minimize resources by trading off 
schedule delay. And each primary 
consideration would require a dif 
ferent planning and execution 
approach. 

In order to execute an engineer- 
ing project, the fine resolution and 
detailed planning Armour 
objected to is actually vital to suc- 
cess. One should not think that 
because they are not necessary in 
certain projects they can be unilat- 
erally expunged in all projects, as 
the agile advocates would have us 
believe. 

No rational executive would 
give the green light to a project 


based on an unsubstantiated 
guesstimate of total costs. Risk 
may not be eliminated through 
detailed planning but can be min- 
imized when the plan is based on 
more than wishful thinking. 
Without a good process that 
selects the right approach, mis- 
takes are likely, and tools, not 
users, are likely to be blamed for 
the result. 

Armour correctly noted that 
many requirements are not prop- 
erly defined before a project gets 
under way. The only time that can 
happen without detailed require- 
ments is when it’s for research or 
possibly as a feasibility project to 
aid in planning. If it is, it should 
not be confused with the effort 
needed to create a system or prod- | 
uct. Moreover, the software engi- 
neer is not responsible for 
determining system requirements; 
the systems engineer is supposed 
to give them to the software engi- 
neer. 

Many practitioners do not 
understand functional require- 
ments or the effects of nonfunc- 
tional components on total 
requirements; they may be 
unstated yet still return to bite 
everyone later on. Practitioners 


COMMUNICATIONS OF THE ACM April 2006/Vol. 49, No. 4 11 


COMMUNICATIONS OF THE ACM April 2006/Vol. 49, No. 4 13 


“The Digital “Online Books , ens ae 
Library” and Courses” ~“Publications” “Conferences” 


Benjamin Mako Hill Maria Klawe 
Dean of Engineering 


Vinton G. Cerf Amy Wu 
Vice President and Chief Internet Evangelist Computer Science Student 
Stanford University 


Research Assistant 


Google MIT Media Laboratory Princeton University 


, COLLABORATION & INNOVATION IN COMPUTING 


Uniting the world’s computing profe 


spire dialogue, Association for Computing Machinery 


Advancing Computing as a Science & Profession 
www.acm.org/learnmore 


share resources and address the computing 


field's challenges in the 21st Centt 


ROBERT NEUBECKER 


The Profession of IT | Peter J. Denning 


Hastily Formed Networks 


The ability to form multi-organizational networks rapidly is crucial to 
humanitarian aid, disaster relief, and large urgent projects. Designing and 
implementing the network’s conversation space is the central challenge. 


n Sept. 11, 2001, terrorists 
Os the World Trade 

Center, taking 2,749 lives. 
The attack resulted in severe eco- 
nomic impact, especially to air- 
lines, and a stock market loss of 
$1.2 trillion. On Dec. 26, 2004, 
a tsunami from a 9.1 earthquake 
overran the shores of many coun- 
tries along the vast rim of the 
Indian Ocean. Over 283,000 peo- 
ple died. On Aug. 29, 2005, Kat- 
rina, a category-5 hurricane, 
knocked out electric and commu- 
nication infrastructure over 
90,000 square miles of Louisiana 
and Mississippi and displaced 1.5 
million people. Six months later, 
New Orleans still housed fewer 
than 100,000 of its original 1.2 
million residents. On Oct. 8, 
2005, a magnitude-7.6 earthquake 
devastated the Kashmir region of 
Pakistan, killing over 87,000 peo- 
ple. Besides being unexpected 
major disasters, these events had 
one other common feature: they 
all involved hastily formed net- 
works that quickly mobilized, 
organized, and coordinated mas- 
sive humanitarian responses. 


The severity of these disasters 
drove home an important point: 
the quality of the response 
depended not on response plan- 
ning or on new equipment, but 
on the quality of the network 
that came together to provide 
relief. How quickly were voice 
and data communications 
restored? How well did the many 
players from disparate organiza- 
tions collaborate? How effectively 
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did the network deliver help to 
the victims? These incidents 
demonstrated sharp differences in 
the quality of the hastily formed 
network (HEN), which directly 
affected the effectiveness of the 
response. Noting that these net- 
works almost always involve mili- 
tary, civilian government, and 
non-government organizations, 
the U.S. Departments of Defense 
and Homeland Security have 
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made it a priority to learn how to 
effectively assemble HFNs. We 
coined the term at the Naval 
Postgraduate School in 2004. 

The lessons learned from the 
networks involving government 
carry directly into private settings. 
They will benefit any urgent net- 
work of multiple organizations 
with no common authority that 
must cooperate and collaborate. 

Hastily formed networks is an 
area where advanced networking 
technology and human organiza- 
tion issues meet. They can work 
well together, or they can clash. 
Our purpose here is to give an 
overview of this critical area and 
the challenges it offers to com- 
puting professionals. 


ORIGINS 
The idea of quickly forming a 
team for a particular, urgent task, 
and then disbanding it when 
done, is not new. Table 1 lists 
three categories of events for 
which an HFN must respond. 
Because it involves relatively small 
teams and known networks, the 
first category is the easiest and 
least likely to stress the HFN. 
The middle category is the 
type that emergency agencies 
such as police and fire depart- 
ments prepare for. They have pro- 
fessional, highly trained teams 
ready to respond to particular 
incidents. They have well-devel- 
oped practices for advance plan- 
ning, training in appropriate 
skills, and positioning of equip- 
ment. They already use terms like 
“ad hoc network” and “crisis 
response network” to describe 
what they do. 


Know what to do 
Use existing network structures 
May choose not to respond 


Don't know what to do 


Unknown Don't know time or place 


Unknown 


The third category puts the 
greatest stress on the HFN. These 
events require response beyond 
the control and capabilities of any 
single agency. The network struc- 
ture will depend on the event and 
the responding organizations. 

The main aspects of the third- 
category challenge are: 


* Genuine surprise. The precipi- 
tating event is in no known cate- 
gory. There has been no advance 
planning, training, or positioning 
of equipment. 

* Chaos. Everyone is over- 
whelmed. No one understands 
the situation or knows what to do. 
People are frantic and panicky. 

* Totally insufficient resources. 
Available resources and training 
are overwhelmed by the magni- 
tude of the event. 

¢ Multi-agency response. Several 
agencies must cooperate in the 
response, including military, civil- 
ian government, and private orga- 
nizations. These groups have had 
little or no prior reason to collab- 
orate. The shock of moving from 
a state of “coexistence” to a state 
of “collaboration” can be over- 
whelming. 

¢ Distributed response. The 
response is distributed over a geo- 
graphical area into many local 
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Responding network structure unknown 


Fast response team for time-critical business 
problem or opportunity 


9/\1 attack, other terrorist attacks, large 
earthquake, major natural disasters 
(Note: KU events can become UU events when 
scaled up to large areas or populations) 


Table 1. Kinds of events requiring response 
from hastily formed networks. 


jurisdictions. The authority to 
allocate resources and reach deci- 
sions is distributed among many 
organizations. Decisions by com- 
mand-and-control do not work. 
° Lack of infrastructure. Critical 
infrastructures such as communi- 
cations, electricity, and water do 
not work. Makeshift infrastruc- 
tures must be deployed quickly. 


HFN DEFINED 

The first priority after the precip- 
itating event is for the responders 
to communicate. They want to 
pool their knowledge and inter- 
pretations of the situation, under- 
stand what resources are available, 
assess options, plan responses, 
decide, commit, act, and coordi- 
nate. Without communication, 
none of these things happens: the 
responders cannot respond. Thus 
the heart of the network is the 
communication system they use 
and the ways they interact within 
it. We call this the “conversation 
space” of the HFN. 

An HEN has five elements: it is 
(1) a network of people established 
rapidly (2) from different commu- 
nities, (3) working together in a 


shared conversation space (4) in 
which they plan, commit to, and 
execute actions, to (5) fulfill a 
large, urgent mission. 

An HEN is thus much more 
than a set of organizations using 
advanced networking technology. 
To be effective in action, HFN 
participants must be skilled at: 


* Setting up mobile communica- 
tion and sensor systems; 

* Conducting interagency opera- 
tions, sometimes called “civil-mil- 
itary boundary”; 

* Collaborating on action plans 
and coordinating their execution, 
¢ Improvising; and 

* Leading a social network, where 
communication and decision 
making are decentralized, and 
there is no hierarchical chain of 
command or ex officio leader. 


Most participants do not have 
a need for these skills in their 
individual organizations. When 
they come together, therefore, 
they find it difficult to accom- 
plish these tasks. When combined 


Table 2. Components of 
conversation space. 


Media and mechanisms by which people 
communicate, share information, and 
allocate resources 


Physical 
systems 


Interaction Rules of the “game” followed by the 
practices 


and achieve their outcomes 


players to organize their cooperation 


with the overwhelming nature of 
the urgent event, these inherent 
difficulties can lead to a break- 
down in the conversation space. 


CONVERSATION SPACE 

The ongoing need to communi- 
cate and coordinate is fundamen- 
tal for the success of any HFN. 
The term conversation space was 
introduced for the medium in 
which all this takes place—from 
forming community responses to 
delivering actions. The conversa- 
tion space is (1) a medium of 
communication among (2) a set of 
players (3) who have agreed on a 
set of interaction rules. These 
three aspects are summarized in 
Table 2. 

One of our early conclusions 
was that the effectiveness of the 
HEN rests on the quality of the 
conversation space established at 
the outset. It is not a foregone 
conclusion that an effective HFN 
can be established even when the 
players are trained professionals, 
as the situations in New York 
City after 9/11 and in New 
Orleans after Hurricane Katrina 
illustrate. In New York, the 
mayor understood intuitively that 


Telephone, power, roads, meeting places, 
supplies, distribution systems 


Situational awareness, sharing information, 
planning, reaching decisions, coordination, unified 
command and control, authority, public relations. 
(Note: environment has no common authorities, 
no hierarchy, many autonomous agents, 
decentralized communications) 
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success would depend on every- 
one, including especially the resi- 
dents of the city, feeling included 
in the relief effort. He made sure 
information was shared, even if 
piecemeal. While there were some 
initial coordination difficulties, 
the network came together and 
was effective in relief and recov- 
ery. A different picture occurred 
in New Orleans. The various 
agencies had major difficulties in 
coordinating and the Federal 
Emergency Relief Agency 
(FEMA) did not deliver what 
people thought it had promised. 
At all levels there was a lot of fin- 
ger-pointing and wrangling over 
who would do what and who 
would pay for what. When the 
president put a new man in 
charge at FEMA, there was no 
immediate improvement in effec- 
tiveness or criticism of the agency. 
Attempts to impose standard mil- 
itary-style command-and-control 
in Louisiana and Mississippi were 
ineffective. This is not intended 
as a criticism of New York, 
Louisiana, or Mississippi officials, 
but rather an illustration that 
effective coordination may not 
happen even when all the parties 
want it to happen. 

Certainly a major difference 
between New York and New 
Orleans was the sheer scale of the 
event. New York lost infrastruc- 
ture in a limited area of perhaps 
100 square blocks. The primary 
agencies in the network ulti- 
mately reported to the mayor. 
Police and fire radios provided 
basic communications in the 
“ground zero” area. In contrast, 
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New Orleans lost an entire city 
and was part of a large area 
(90,000 square miles) with 
severely damaged infrastructure. 
All communication systems were 
knocked out; and as they were 
gradually being restored, the lim- 
ited-bandwidth channels were 
overwhelmed by sheer numbers 
of citizens trying to use them. 
Many more agencies had to 
cooperate on the response. Cop- 
ing with all this effectively was 
completely outside most respon- 
ders’ experience. 

New York City quickly built 
trust among the responders and 
citizens. New Orleans experi- 
enced considerable difficulty in 
building trust. 

But this is one of the lessons: 
the more overwhelming the 
event, the more likely turf-assert- 
ing tendencies will occur and 
interfere with the effectiveness of 
the network. 

The overarching lesson is: the 
effectiveness of an HFN depends 
as much on the participating peo- 
ple and organizations as it does 
on the communication system 
through which they interact. 


CONDITIONED TENDENCIES 

It is well known that individuals 
under severe stress often forget 
their recent training and regress 


to old, ingrained habits [1, 7]. 
Richard Strozzi Heckler calls 
these old habits conditioned ten- 
dencies [6]. The old habit is 
likely to be inappropriate for the 
current situation and to make 
matters worse. 

The National Institute of Stan- 
dards and Technology ([4], p. 
174) concluded that “a prepon- 
derance of evidence indicates that 
emergency responder lives were 
likely lost at the World Trade 
Center resulting from the lack of 
timely information-sharing...” 
Police radio transcripts cited by 
NIST indicate that NYPD heli- 
copters monitoring the two 
burning towers detected signs of 
structural collapse in the North 
Tower and issued an emergency 
evacuation order to all police. Yet 
no one in the police department 
communicated the imminent- 
collapse information to the fire 
department. What accounts for 
this bizarre behavior? 

Joseph Pfeifer, a deputy assis- 
tant chief in the New York City 
Fire Department, gives in his 
master’s thesis a detailed example 
of conditioned tendencies 
instilled by emergency-response 
organizations, which paradoxi- 
cally can render them incapable 
of effective response in an emer- 
gency [5]. Pfeifer was among 


those responding to the 9/11 dis- 
aster in the World Trade Center. 
His explanation for non-commu- 
nicative behavior was that organi- 
zational biases—ingrained social 
habits of the separate organiza- 
tions—prevented emergency per- 
sonnel from talking to one 
another. One of these biases is 
organizational social identity that 
prefers to share information 
within the group but not outside. 
Under stress, the group members 
do not think to collaborate or 
share information outside the 
group, or to take personal respon- 
sibility for the welfare of mem- 
bers of other groups. 

The purpose of Pfeifer’s study 
was not to assign blame for need- 
less loss of life in the 9/11 
attacks, but to recognize the orga- 
nizational conditioned tendency 
as a real phenomenon that can 
disable an HFN. The question is 
how to prepare organizations to 
work together in an HFN and 
avoid the conditioned tendency. 
Pfeifer proposed that the agencies 
use unified command networks, 
in which leadership is shared 
among different organizations; 
for example, an executive com- 
mittee. This practice will likely 
create the foundation for HFNs 
that do not suffer non-communi- 
cation leadership paralysis. 


The effectiveness of an HFN depends as much on the 
participating people and organizations as it does on the 
communication system through which they interact. 
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A GuibE To Errective HFNs 

(1) The quality of the conversa- 
tion space is critical to success. 
The space includes the communi- 
cation systems, the participants, 
and their interactions within 
these systems. Effectiveness in 
conversation space rests on skills 
that participants may not ordi- 
narily learn in their separate orga- 
nizations. 

(2) The physical communica- 
tion systems are part of conversa- 
tion space. Plan and test mobile 
technologies that can be set up 
quickly when the regular infra- 
structure is down. Arrange for 
security forces to protect the tem- 
porary infrastructure. Use and 
test all communications equip- 
ment regularly. Use standard 
software and protocols—interop- 
erability and simplicity of inter- 
connection will be important. 
Web services are a good example. 

(3) The participating organiza- 
tions are another part of conver- 
sation space. Each brings its own 
culture, standard practices, and 
decision-making protocols— 
which may be incompatible with 
other organizations. Individuals 
can become disoriented when 
familiar organizational practices 
are suspended. They fail to take 
initiative, while waiting for orders 
that will never come. They do 
not know how to function when 
there is no common authority, 
their established command-and- 
control practices do not work, 
and collaboration, not control, is 
the only way to get actions done. 

(4) Information glut will be a 
problem in the network. As com- 


munications are initially restored, 
the victims will overload the 
severely limited bandwidth as they 
try to communicate with their 
families. The responders them- 
selves will overwhelm their col- 
leagues with situational reports 
and other data. New technologies 
will be needed to manage infor- 
mation glut and keep the network 
functioning. 

(5) Understand and practice 
the effective technologies for col- 
laborative networks. These 
include Web servers to distribute 
information, wiki and discussion- 
thread software, chat and instant- 
messaging services, virtual 
markets, and coordination services 
such as Groove (but Groove is 
restricted to Windows platforms). 

(6) Prepare to overcome the 
barriers to interorganizational col- 
laboration. These include conflict- 
ing missions, unclear roles, turf 
protection, incompatible processes 
and information systems, disparate 
cultures, accountability, mistrust, 
and lack of knowledge of others’ 
capabilities [3]. 

(7) Prepare for organizational 
conditioned tendencies to appear 
under overwhelming stress. Train 
group members in the basic HFN 
skills. Promote political support 
for the organizations to cooper- 
ate, mutual respect for the com- 
petencies that each organization 
brings, concern for each other’s 
welfare, and personal responsibil- 
ity for actions and outcomes. 
Practice with “unified com- 
mand”—an executive committee 
representing the participating 
organizations that respects the 


core competencies that each orga- 
nization brings. 

(8) Train the skill of improvisa- 
tion. This is a challenge for nor- 
mal rule-oriented agencies. 


CHALLENGES FOR COMPUTING 
PROFESSIONALS 

HENs bring new words and con- 
cepts such as conversation space, 
coordination without hierarchy, 
and conditioned tendency. Learn 
these concepts; they are important. 

Interoperability and simplicity 
are key technical challenges for 
HENs. Services offered via Web 
interfaces are highly interopera- 
ble; anyone can use them from 
any computer. Chat and text 
messaging services are highly 
interoperable. But many key ser- 
vices are not. For example, many 
responders have found the 
Groove software to be useful for 
coordination, but Groove runs 
only on Windows computers; 
responders with Sun worksta- 
tions, Apple Macintoshes, or 
Linux-based computers are out of 
luck. Many wireless networks are 
not fully interoperable, for exam- 
ple, Linux and Apple machines 
use different protocols from Win- 
dows machines for encryption 
and passwords. 

To prevent information glut, 
we need tools to model, label, 
and filter information so that net- 
work participants receive the 
information most likely to be 
valuable to them. In crises espe- 
cially, the participants need to 
make most effective use of the 
limited resources of decision- 
making time and communica- 
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tions bandwidth by restricting 
the flow of unimportant bits. 
Hayes-Roth shows how a 
100,000-fold drop in informa- 
tion volume is easily achievable 
without loss of effectiveness [2]. 
Learn about the organizational 
issues, such as collaborative coop- 
eration and managing condi- 
tioned tendencies. You may be 
part of an organization that 
responds in an HEN, and you 
will need to know this. 
Understanding how to create 
HENs is one of the most challeng- 
ing parts of modern networking. It 
is about how a network, its people 
and its equipment, may function 
efficiently under extreme stress. & 
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Phishing Mongers and Posers 


Unmasking deceptive schemes that range from clever to clumsy. 


he following definition 
Tie the Antiphishing Web 

site (www.antiphishing.org) 
is a useful place to begin this col- 
umn: 

What is phishing and pharm- 
ing? Phishing attacks use both 
social engineering and technical 
subterfuge to steal consumers’ 
personal identity data and finan- 
cial account credentials. Social- 
engineering schemes use 
‘spoofed’ email to lead con- 
sumers to counterfeit Web sites 
designed to trick recipients into 
divulging financial data such 
as credit card numbers, 
account usernames, passwords 
and Social Security numbers. 
Hijacking brand names of 
banks, e-retailers, and 
credit card companies, 
phishers often convince 
recipients to respond. 
Technical subterfuge 
schemes plant crimeware onto 
PCs to steal credentials directly, 
often using Trojan keylogger spy- 
ware. Pharming crimeware mis- 
directs users to fraudulent sites 
or proxy servers, typically 


through DNS hijacking or poi- 
soning.” 
The phish mongers to which I 


refer in the title of this column are 


those who deploy these phish 
scams in such a way that they 
stand a measurable chance of suc- 
cess against a reasonably intelligent 


and enlightened end user. The 


posers are the bottom feeders in 
the phishing community that 
exhibit a very low level of sophisti- 
cation. This distinction is critical if 
one attempts to thwart phishing. 


PHISHING FACTS 
There’s more to phishing than 
throwing digital bait on the Net. 
All too often descriptions of 
phishing scams drill down into 
deceptive URLs, fake address 
bars, and the like but fail to 
investigate the set-up that pre- 
cedes the sting. 

The essential requirements of 
effective phishing require the bait: 


* Look real; 

¢ Present itself to an appropriate 
target-of-opportunity; 

¢ Satisfy the reasonableness con- 
dition (going after the bait is not 
an unreasonable thing to do); 

* Cause the unwary to suspend 
any disbelief; and 

* Clean up after the catch. 


The similarities with angling 
should not be overlooked. There 


are reasons why anglers neither 
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Figure 1a. Phishing email that satisfies 
the effectiveness criteria. 


troll with charcoal briquettes nor 
fly-fish for sharks. I will illustrate 
the analogy to the digital surf 
with a few examples taken from 
one of the phishing research proj- 
ects in our lab. 

Figure la is modeled after some 
live phish we captured on the 
Net. Let’s analyze this in terms of 
the five criteria listed earlier. First, 
the email looks legitimate—at 
least to the extent it betrays noth- 
ing suspicious to a typical bank 
customer (aka target-of-opportu- 
nity). The graphic appears to be a 
reasonable facsimile of a familiar 
logo, and the salutation and letter 
is what we might expect in this 
context. 

Second, the target is the subset 
of recipients who are Bank of 
America customers. The fact that 
the majority of recipients are not 
customers is not a deterrent 
because there’s no penalty for 
over-phishing in Internet waters. 
Third, the request seems entirely 
reasonable and appropriate given 
the justification. We reason that if 
we were a bank, we might do 
something similar under such cir- 
cumstances. Fourth, the URL link 
seems to be appropriate to the 
brand. The unwary among us 
might readily trade off any linger- 
ing disbelief for the opportunity 
to correct what might be a simple 
error that could adversely affect 
use of a checking or credit card 
account. We may assume the link 
to verify.bofa.com would take us 
to an equally plausible Web form 


that would request an account 


22 


Eudora - [In] 
D&D & abAS | 


Woo ¢ Boas fo HRAOSR 
SO utes 


Bank of America 12:11 PMB8/N3/2 2 


EET Se 


To: “J. Smith” <jsmith@crlmail.i2.nscee.edu> 
Subject: B of A~- Account Verification 


Bankof America Sg 


Dear Bank of America Customer, 


access could become restricted. 


tip. verify. bofa, com/ 


Thank you, 


Roger Jones 
Senior Accounts Manager 


(SaeeT ew fer 


o< C:\WINDOWS\system32\cmd.exe 


UserID-> mjones 
Passcode-> alemaiels 


During our regular update and verification of accounts, we couldnt verify your current information. 
{t is possible that your information has changed or it is incomplete. 
Please login and verify your information when you sign in. If account information is missing, your account 


Please go to this site and verify your account information 


** Please do not reply to this message as you will not receive a response 


B of A~ Account Verification E} 


R 


Jet izle 


name and password information. 

The unwary in this case is M. 
Jones, whose harvested Web form 
appears to the phisherman as in Fig- 
ure 1b. This is a screenshot of an 
actual phishing server in my 
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Figure 1b. Phishing from the 
phisherman’s perspective. 


research lab. 
In order to complete the scam 
the fifth condition must apply. In 


this case, after the private infor- 
mation is harvested, the circle is 
completed when the phishing 
server redirects the victim to the 
actual bank site. This has the 
effect of keeping the bank’s server 
logs roughly in line in case some- 
one makes an inquiry to the 
bank’s help desk. 

Figure 2 illustrates this activity. 
Of course, a more careful inspec- 
tion of the bank’s server logs 
would reveal a flaw in this simpli- 
fied approach, because the phish- 
ing server shows up in as the 
“referrer” —a telltale sign the 
phisher would like to avoid. But, 
this deficiency could be overcome 
by a bit of careful packet crafting. 

The preceding example is a 
well-known exploit strategy. Some 
sub-cerebral variations on this 
theme appear in the sidebar 
“Phishing Expeditions.” 


MONGERS AND MAYHEM 

So much for posers. My last 
example is a phish of a different 
stripe. So much so that it justifies 
discussion. It comes to us 
through an ISP in Shanghai in a 
cleverly disguised way. 

Look carefully at the cursor in 
Figure 3. The cursor seems to be 
sensing the link even though it’s 
not particularly close to it. The 
fact is it’s not sensing that link at 
all, but rather an image map. A 


Phishing Expeditions 


Example 1: from the 218.12. class B network registered to an ISP in Beijing. 

GIST OF EMAIL: A U.S. bank admits that its database has been hacked. The bank needs to 
have your bank debit card number, account ID, and PIN immediately. 

NOTABLE QUOTES: “This process is mandatory, and if you did [sic] not sign on within the 
nearest time [sic] your account may be subject to temporary suspension.” 

PHISHING LINE: http://218.12.29.40 

TARGET-OF-OPPORTUNITY: Someone grammatically challenged, especially with respect to 
tense and adjectives, who both a fan of online banking and newbie to the Web. 
EFFECTIVENESS CRITERIA SCORE: 0.5 out of 5. 

Example 2: from the 80.53 class B registered to an ISP in Gdansk. 

GIST OF EMAIL: An eCash service claims to have noticed attempts to log in to the user’s 
account from a foreign IP address. 

NOTABLE QUOTES: “...we have reasons to belive [sic] that your account was hijacked by a 
third party... If you choose to ignore our request, you leave us no choise [sic] but to tempo- 
raly [sic] suspend your account.” 

PHISHINGLINE: click here 

TARGET-OF-OPPORTUNITY: Submissive types who click on anything when told to do so and 
also subscribe to the school of relaxed orthography. 

EFFECTIVENESS CRITERIA SCORE: 1 out of 5 is charitable. 

EXAMPLE 3: from an IP in Russia operating through a Web hosting service in Jordan. 

GIST OF EMAIL: A well-known Web auction company’s Departament [sic] indicates that their 
records are out of date and that billing information must be updated within 24 hours or the 
account will be terminated. 

NOTABLE QUOTES: “Departament” is just a start. How about “Please update your records 
in maximum 24 hours.” 

PHISHINGLINE: Please click here to update your billing records. 

However, a look at the source page shows the actual URL is 

<a href=“http://darkcity.ru/acounts/memb/avncenter/dll87443/.BaylSAPI.dll/ 
hgdas676bsda6gwev7zfewfewf34gfwf23g235/13.4f3fg3f&bhdfahva68532hbhwseBaylSAPI.dllPay- 
mentLanding&ssPageName=hhpayUSf&=userhgads&secure&ssl7ravbd7dsb.html”>. What is 
the likelihood that our auction company works under the domain name of “darkcity” through 
an ISP in Russia? | don’t think so. 

TARGET-OF-OPPORTUNITY: People who like lots of vowels who don’t know how to use the 
“view source” menu option in their browser. 

EFFECTIVENESS CRITERIA SCORE: 1.5 out of 5 seems generous to me. 


the whole page. Second, the image 
that is mapped is the actual text of 
the email. So what appeared to be 
email was just a picture of email. 
Thus, the redirect was actually not 
a secure connection to eBay at all 
as it appeared, but an insecure 


connection to 218.1.XXX.YYY/ 


quick review of the source code, 
shown beneath the figure, leads 
us to a veritable cornucopia of 
trickery. 

Several features make Figure 3 
and its associated source code 
interesting. First, the image map 
coordinates take up pretty much 


If it weren’t profitable for these cyber crooks to phish, 


they wouldn’t do it. 
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Digital Village 


Misseart| € )eark of beverica | Home ... | il Command Pramet - phich..| 


ey 

Dear eBay Member, 

We regret to inform you that your eBay account could be suspended if you don't re-update your 
account information. 


To resolve this problem please visit link below and re-enter your account information. 


https://signin.ebay.com/ws/eBay|SAPI.dll?Signin&sid=verify&co partnerld=2&siteid=0 


If your problems could not be resolved your account will be suspended for a period of 24 hours, 
after this period your account will be terminated. 


For the User Agreement, Section 9, we may immediately issue a warning, temporarily suspend, 
indefinitely suspend or terminate your membership and refuse to provide our services to you if 
we believe that your actions may cause financial loss or legal liability for you, our users or us. 
We may also take these actions if we are unable to verify or authenticate any information you 
provide to us. 


Due to the suspension of this account, please be advised you are prohibited from using eBay in 
any way. This includes the registering of a new account. Please note that this suspension does 
not relieve you of your agreed-upon obligation to pay any fees you may owe to eBay. 


Regards, 

Safeharbor Department eBay, Inc 

The eBay team 

This is an automatic message, please do not reply 


Figure 3. Phish mongering. 


Source Code for Figure 3 


<x-html> 

<html><p><font face= “Arial”><A 
HREF=“https://signin.ebay.com/ws/eBaylSAPI.dil?Signin&sid=verify&co_partnerld=2 
&siteid=o”><map name=“xlhjiwb”><area coords= “o, 0, 646, 569” shape="rect” href= 
“http://218.1.XXX.YYY/.../e3b/"></map><img SRC= 


“cid:parti.04050500.04030901 @support_id_314202457@ebay.com” border= “o” usemap= 


“#xlhjiwb"></A></a></font></p><p><font color= “#FFFFF3">Barbie Harley 
Davidson in 1803 in 1951 AVI 
</x-html> 
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Figure 2. Phish clean up. 


.../e3b/. While Windows users 
see the “dots of laziness” fre- 
quently when a path expression 
is too long for the path pane in 
some window, this isn’t a Win- 
dows path in a path pane. These 
“dots of laziness” are a directory 
name. Now why would one cre- 
ate a directory named “...” It 
certainly falls short of the 
mnemonic requirements most of 
us learned in introductory pro- 
gramming courses. 

On the other hand, it might 
blend in stealthily with the other 
Unix/Linux hidden files “.” and 
“..” and possibly escape an 
onlooker’s suspicion. This sug- 
gests the computer at the end of 
218.1.XXX.YYY may not be the 
phisher at all, but another unsus- 
pecting victim whose computer 
has been compromised (for that 
reason, I’ve concealed the final 
two octets of the IP address). 
Another sign of intrigue is the 
font color of almost pure white 
“#FFFFF3” for “Barbie Harley 
Davidson in 1803 in 1951 AVI.” 
Though their names are sullied, 
neither Barbie nor Harley David- 
son had anything to do with this 
scam. This white-on-white hid- 
den text is there to throw off the 
Bayesian analyzers in spam filters. 
Note that the email text is actu- 
ally a graphic, so the Baysian 
analysis likely concludes that this 
is about Barbie and her Harley. 

As opposed to the posers, this 
phish monger is moderately 
clever. While the exploit may not 
earn a trophy, it’s a keeper. 


URL Pearls 


The best starting point for phishing awareness is the Antiphishing Web site (www.antiphish- 
ing.org). This site contains useful statistics, charts, events lists, and archives of virtually all 
aspects of phishing collected by third-party sources. Not surprisingly, the site’s bar charts 
reveal that phishing is on the rise. The statistics are unweighted so the impact of the major 
offending nations tend to be underrepresented. More meaningful measures might include 
percentages of attacks as a percentage of registered IP addresses, percentage of Internet 


traffic, and so forth. 


Fraudwatch International’s Web site (www.fraudwatchinternational.com) is a good place 
to start investigation of phishing scams. It lists current alerts for both phishing and non- 
phishing scams, Internet fraud, and identity theft. On a busy day, their Web site produces a 
dozen or more new phishing alerts. Since phishing is propagated primarily through email, 
the lifespan of a phish scam is measured in days before it is included in malware and firewall 
updates. Consequently, only the most recent reports represent those variations on the theme 
that are the most dangerous. As of this writing, there were 22,821 individual phishing 
exploits being monitored by Fraudwatch, 590 of which are currently active. Fraudwatch also 
offers Fraudshield, a phish-filtering utility that works much like an anti-virus program. 

For a historical perspective on email, see my April 1997 column “Email: The Good, the 


Bad, and the Ugly.” 


Conclusion 

It is unfortunate in the extreme 
that there are victims who fall for 
the fatuous phishing scams. We 
would all sleep better if the kind 
of flagrant errors characterized by 
our posers automatically ruled 
them out of all consideration. 
But they don’, unfortunately. 
Unlike cracking and the business 
of script kiddying, phishing is 
economically motivated: if it 
weren't profitable for these cyber 
crooks to phish, they wouldn't do 
it. And if the posers are occasion- 
ally effective, it’s no wonder that 
the mongers account for eco- 
nomic losses in the billions of 
dollars each year—losses that are 
ultimately born by the cus- 
tomers. 

All four examples, even those 
written by the posers, managed to 
escape detection by one of my 
spam/phish filters within the last 


few months. The likelihood is that 
future phishing, or whatever 
phoolware follows it, will continue 
the cat-and-mouse game with secu- 
rity software. Perhaps our greatest 
mistake is excessive reliance on 
technology solutions. Our efforts 
seem no more effective at blocking 
phish scams now than they were at 


blocking embedded executables 10 
years ago. 


When it comes to email, com- 
mon sense still goes a long way. @ 


HAL BERGHEL (www.berghel.net) is 
associate dean of the Howard R. Hughes College 
of Engineering and Director of the Center for 
Cybermedia Research at the University of 
Nevada, Las Vegas, and co-director of the 
Identity Theft and Financial Fraud Research 


and Operations Center. 
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THE HISTORY OF ACM 


As the 6oth anniversary 

of ACM approaches, 
Communications is planning a 
special section next year on 
the history of the organization, 
its activities, and its role in the 
development of computing. 
The guest editors of this section 
are particularly interested in 
articles from historians of 
computing. 

Articles should be about 3,000 
words in length and should be 
submitted to David S. Wise 
(dswise@cs.indiana.edu) by 
May 16, 2006. Authors will be 
notified of a preliminary 
decision in October 2006 and 
will have until December 15 to 
submit a revised version for 
final consideration. 
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Voter Database Report 


Statewide Databases of Registered 


Voters 


A study of accuracy, privacy, usability, security, and reliability issues. 


CM’s U.S. Public Policy 

Committee (USACM) 

formed a committee last 
year to provide states with guid- 
ance on implementing the 
statewide voter-registration data- 
bases mandated by the Help 
America Vote Act, a federal law 
passed in the wake of the contro- 
versial 2000 Presidential Elec- 
tions. The committee recently 
released a report on its study; for 
the complete version please visit 
www.acm.org/usacm/VRD. 


EXECUTIVE SUMMARY 

The voter registration process 
may seem simple to most voters. 
They give their names, addresses, 
birth date, and in some cases 
party affiliations to election offi- 
cials with the expectation that 
they will be able to vote on Elec- 
tion Day. In reality, election offi- 
cials must oversee a complex 
system managing this process. 
They must ensure that the vot- 
ers’ information is accurately 
recorded and maintained, that 
the system is transparent while 
voter information is kept private 
and secure from unauthorized 
access, and that poll workers can 
access this information on Elec- 


tion Day to determine whether 
or not any given voter is eligible. 
A well-managed voter registra- 
tion system is vital for ensuring 
public confidence in elections. 

State and local governments 
have managed voter registration 
using different approaches among 
different jurisdictions. In 2002, 
Congress sought to make these 
disparate efforts more uniform by 
passing the Help America Vote 
Act, which required that each 
state have a computerized 
statewide voter registration data- 
base. In implementing this man- 
date, state and local governments 
still have differing approaches, 
but it is clear that information 
technology underpins each of 
their efforts. While technology 
will help election officials manage 
this complex system, it also cre- 
ates new risks that must be 
addressed. 

This study focuses on five areas 
that election officials should 
address when creating statewide 
voter registration databases 
(VRDs): accuracy, privacy, usabil- 
ity, security, and reliability. Each 
chapter contains detailed discus- 
sions and recommendations. The 
following are some of the overar- 
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ching goals for VRDs and 
selected recommendations for 
achieving them. 


1. The policies and practices of 
entire voting registration sys- 
tems, including those that gov- 
ern VRDs, should be 
transparent both internally and 
externally. 

VRDs control access to voting; 
therefore, they have a direct 
impact on the fairness of elec- 
tions, as well as the public’s per- 
ception of fairness. It must be 
possible to convince voters, politi- 
cal parties, politicians, academics, 
the press, and others that VRDs 
are correct and are operating 
appropriately. Internal procedures 
and interfaces also must be clear 
to election workers in order to 
minimize errors. Transparency 
can be provided by allowing vot- 
ers to verify their voter registra- 
tion status and data; publicly 
disclosing outside data sources 
that officials use for verification; 
indefinitely keeping a secure 
write-once VRD archive in elec- 
tronic form to allow audits of pre- 
vious elections; and using 
independent experts to audit and 
review VRD security policies. 


Other goals such as accountability, 
audits, and notification also sup- 
port transparency and are dis- 
cussed here. 


2. Accountability should be 
apparent throughout each VRD. 

It should be clear who is 
proposing, making, or approving 
changes to the data, the system, or 
its policies. Security policies are an 
important tool for ensuring 
accountability. For example, access 
control policies can be structured 
to restrict actions of certain 
groups or individual users of the 
system. Further, users’ actions can 
be logged using audit trails (dis- 
cussed here). Accountability also 
should extend to external uses of 
VRD data. For example, state and 
local officials should require recip- 
ients of data from VRDs to sign 
use agreements consistent with 
the government’ official policies 
and procedures. 


3. Audit trails should be 
employed throughout the VRD. 
VRDs that can be indepen- 

dently verified, checked, and 
proven to be fair will increase 
voter confidence and help avoid 
litigation. Audit trails are impor- 


tant for independent verification, 
which, in turn, makes the system 
more transparent and provides a 
mechanism for accountability. 
They should include records of 
data changes, configuration 
changes, security policy changes, 
and database design changes. The 
trails may be independent records 
for each part of the VRD, but 
they should include both who 
made the change and who 
approved the change. 


4, Privacy values should be a 
fundamental part of the VRD, 
not an afterthought. 

Privacy policies for voter regis- 
tration activities should be based 
on Fair Information Practices 
(FIPs), which are a set of princi- 
ples for addressing concerns about 
information privacy. FIPs typically 
address collection limitation, data 
quality, purpose specification, use 
limitation, security safeguards, 
openness, individual participation, 
and accountability. There are 
many ways to implement good 
privacy policies. For example, we 
recommend that government both 
limit collection to only the data 
required for proper registration 
and explain why each piece of 


personal information is necessary. 
Further, privacy policies should be 
published and widely distributed, 
and the public should be given an 
opportunity to comment on any 
changes. 


5. Registration systems should 
have strong notification policies. 

Voters should be informed 
about their status, election infor- 
mation, privacy policies of the 
government, and security issues. 
As with audit trails, notification 
procedures can improve trans- 
parency; however, they are not 
always widely embraced. A recent 
survey found that approximately 
two-thirds of surveyed states do 
not notify voters who have been 
purged from election rolls. Voters 
should be notified by mail about 
their polling places, any changes 
that may affect their ability to 
vote, or any security breaches that 
expose private data. 


6. Election officials should rigor- 
ously test the usability, security, 
and reliability of VRDs while 
they are being designed and 
while they are in use. 

Testing is a critical tool that 
can reveal that “real-world” poll 
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workers find interfaces confusing 
and unusable, expose security 
flaws in the system, or that the 
system is likely to fail under the 
stress of Election Day. All of these 
issues, if caught before they are 
problems through testing, will 
reduce voter fraud and the disen- 
franchisement of legitimate voters. 
We recommend many different 
ways to test various aspects of 
VRDs throughout the report. 
Examples include evaluation of 
VRD interfaces by laypersons and 
experts for consistency, feedback, 
and error handling; testing inter- 
faces with real-world users and 
conditions, including extreme or 
sub-optimal conditions such as 
high processor load or network 
congestion; and allowing thor- 
ough, independent evaluations of 
the security and reliability of the 
VRD. 


7. Election officials should 
develop strategies for coping 
with potential Election Day fail- 
ures of electronic registration 
databases. 

VRDs are complex systems. It 
is likely that one or more aspects 
of the technology will fail at some 
point. Different strategies can be 
employed to adjust for various 
failures. For example, Election 
Day verifications can be done via 
any of the following: paper sys- 
tems, personal computers or 
hand-held devices with DVD- 
ROMs or other methods of hold- 
ing static copies of the voter list, 
or via personal computers or 
hand-held devices connected by 
electronic communication links to 


central VRDs. Regardless of the 
method used, a fallback process 
should be devised to deal with a 
VRD failure. When appropriate, 
these processes should operate in 
tandem with provisional balloting 
and other measures designed to 
protect the voters’ right to vote. 


8. Election officials should 
develop special procedures and 
protections to handle large-scale 
merges with and purges of the 
VRD. 

One of HAVA’s main require- 
ments is that VRDs be coordi- 
nated with other state databases 
(such as motor vehicle records). 
Ensuring that voter records reflect 
up-to-date information from 
other databases can improve the 
accuracy of VRDs, but coordina- 
tion can introduce errors from the 
same databases, thereby under- 
mining accuracy. Because large- 
scale merges and purges can 
render voters ineligible, the action 
should only be performed by a 
senior election official with proce- 
dures that force some sort of man- 
ual review of the changes. Further, 
if large-scale purges occur, they 
should be done well in advance of 
any election, and anyone purged 
from the database should receive 
notification so that any errors can 
be corrected. 


Conclusion. State and local 
election officials face an ongoing 
and challenging task in creating 
and implementing statewide voter 
registration databases. We hope 
that the discussion and recom- 
mendations in this report will 


28 April 2006/Vol. 49, No.4 COMMUNICATIONS OF THE ACM 


help inform officials and the pub- 
lic on how to meet these chal- 
lenges. 

In issuing this report, we recog- 
nize that many states have been 
working diligently toward meet- 
ing the federal requirement to 
have an operational statewide 
VRD. Both because many states 
will not meet this deadline, and 
because there will be ongoing 
maintenance and changes to any 
such system, state and local gov- 
ernments will also face the issues 
identified in this report well 
beyond the federal deadline. For 
this reason, we offer our contin- 
ued guidance to officials who may 
wish to discuss any of the topics 
raised in this report. @ 


Members of the ACM Committee 
on Guidelines for Implementation 
of Voter Registration Databases 


PAULA HAWTHORN (retired database 
company executive), Co-chair of Study 
BarBARA SIMONS (retired, IBM Research 
and former ACM president), Co-chair of Study 
CuRIS CLIFTON (Computer Science, Purdue) 
DaviD WAGNER (Electrical Engineering and 
Computer Science, UC Berkeley) 

STEVEN M. BELLOVIN (Computer Science, 
Columbia) 

ReBeEccA N. WRIGHT (Computer Science, 
Stevens Institute of Technology) 

ARNON ROSENTHAL (Research Scientist, 
MITRE Corporation) 

RALPH SPENCER POoRE (Consultant, 
Privacy and Security) 

LILLIE CONEY (Associate Director, Electronic 
Privacy Information Center) 

ROBERT GELLMAN (privacy and security 
consultant) 

HARRY HOCHHEISER (Computer 
Professionals for Social Responsibility) 


© 2006 ACM 0001-0782/06/0400 $5.00 


Hot Links | 


Top 10 Downloads from 
ACM's Digital Library 


THE Top 10 Most POPULAR PAPERS FROM ACM’s REFEREED JOURNALS AND 
CONFERENCE PROCEEDINGS DOWNLOADED IN JANUARY 2006 


| 3 “The Google File System.” Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. Proceedings of the 
19th ACM Symposium on Operating Systems Principles. Oct. 2003. 


riflow Management System.” Carlo Combl Gseppe Pout. ae 
‘ale on Applied Computing. Mar. 2004. 


2 2 “Architectures for a Te 
Proceedings of the 2004 


3 4 “dbSwitch.” Shaul Dar, Gil Hecht, Eden Shochat. Proceedings of the 2004 ACM SIGMOD spares 
Conference on Management of Data. June 2004. 


4 391 “Evolution of Web Site Desig ] Patterns. ” Melody Y. Ivory, Rodrick Megraw. ACM Transactions on 
Information Systems. Oct. 200 


5 76 “End-to-End Arguments in Siete Design.” J.H. Saltzer, D.P. Reed, D.D. Clark. ACM Transactions on 
Computer Systems. Nov. 1984. 


Educational Resource 


6° 708 “A Strongly Polynomial-time Algorithm for Over-Constraint Resolution.” Ali Dasdan. Proceedings of the 
| Oth International Symposium on Hardware/Software Codesign. May. 2002. 


based Signal Process 
IGDA 1 Ith International $ 


9 8 “Object-Oriented Programming, Tutorial.” Manuel Alfonseca. Conference Proceedings on APL 90: For the 
Future. May. 1990. 


10 93 “Invariants.” Bengt Ahlgren, Marcus Brunner, Lars Eggert, Robert Hancock, Stefan Schmid. 
of the ACM SIGCOMM Workshop on Future Directions in Network Architecture. Aug. 2004. 
*tied for total number of downloads. 
THE 10 Most PopuLaR Courses AT THE ACM PROFESSIONAL DEVELOPMENT CENTRE 
FOR JANUARY 2006 


| | Java 2 Programming for SDK |.4—Part |: Language 6 6  C# Programming for the Microsoft .NET Platform—Part |: 
Fundamentals | Introduction to C# 
2 2 OOAD: Introduction to Object-Oriented Concepts iz 13 Time Management: Developing a Plan 


3 4 C++ Programming—Part | 8 32 HTML 4.0|—Part |: Fundamentals 


5 Enterprise Connectivity with J2EE 


42 Macromedia Dreamweaver MX 
ee Dreamweaver Basics 


5 3. C Programming—Part | 10 7 OOAD: Introduction to Object-Oriented Analysis and Design 


Statistics for these tables were compiled Feb |0, 2006. 


COMMUNICATIONS OF THE ACM April 2006/Vol. 49, No. 4 29 


Hot Links| 


Top 10 Most POPULAR MAGAZINE AND COMPUTING SURVEYS ARTICLES 
DOWNLOADED IN JANUARY 2006 


| 6 “The Profession of IT: Who are We? eset Denning. Communications of the ACM. Feb. 2001. 


2 20 ogramming,” Caitlin Kelleher, Randy Pausch. ACM Computing Sur 
3 2 “Data Clustering.” A.K. Jain, M.N. Murty, PJ. Flynn. ACM Computing Surveys. Sept. 1999. 
4 ao “The Five Orders of Ignorance.” Phillip G. Armour. Communications of the ACM. Oct. 2000. 
> | “A Survey of Peer-to-Peer Content Distribution Technologies.” Stephanos Androutsellis-Theotokis, 
Diomidis Spinellis. ACM Computing Surveys. Dec. 2004. 
6 3 “Face Recognition.” W. Zhao, R. Chellappa, PJ. Phillips, A. Rosenfeld. ACM Computing Surveys. Dec. 2003. 
7 1 “Topology Control in Wireless Ad Hoc and Sensor Networks.” Paolo Santi. ACM Computing Surveys. June 2005. 
8 54 “Interface Design for Mobile Commerce.” Young Eun Lee, Izak Benbasat. Communications of the ACM. Dec. 2003. 
9 5 “Aspect-Oriented Programming.” Tzilla Elrad, Robert E. Filman, Atef Bader. Communications of the ACM. 
Oct. 2001. 
10 4 “Bioinformaties— ction for Computer Scientists.” Jacques Cohen. ACM Computing Surveys June 2004. 


This list does not include downloads for the December and jaruaty issues as those statistics tend to reflect e-subscribers downloading current issues. 


Coming Next Month in 
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May 

TWO DECADES OF THE LANGUAGE-ACTION PERSPECTIVE 

The language-action perspective (LAP) for design considers software design from the point of view of 

how humans create actions that software can support. LAP has significantly influenced artificial intelligence, 
computer-supported cooperative work, interface design, information system design, interorganizational 
communication, and multiagent systems. This section will examine the wide range of effects LAP has on 

IT over the last 20 years, emphasizing its practical value and what it means for computing professionals 
given that systems supporting communication and business process integration now play pivotal roles 
inside and between organizations. 


Also in May: 
+ Adaptive Web Information Extraction « How UML is Used « Business Technology Strategy in the Early 21st 
Century « The Web of System Performance « Is Information Systems a Reference Discipline? 
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President's Letter David A. Patterson 


Seven Reasons to Shave 
Your Head and Three Reasons 


Not to: The Bald Truth 


Many of these monthly missives have addressed the different ways ACM 
is working to change the image of computing professionals. This month, 
in the spirit of April Fool’s Day, we root out the ACM President’s effort 
to change that image one person at a time—starting at the top. 


n 2003, after years of watching my skin triumph 

over my hair for dominance of my scalp, I surren- 

dered by removing all surviving strands. In exten- 

sive discussions, I found that we of the shiny head 
fraternity have many experiences in common. Based 
on this collected wisdom, I offer a fair and unbiased 
analysis of the positives and negatives of baldhood 
from a man’s perspective. While I believe many of 
these points also apply to women, we wont be sure 
until we see Sinead O’Connor'’s “Forum” letter. 

Let’s look at the negative first, with Three Reasons 
Not to Shave Your Head: 

1. You have a full head of naturally colored hair. 

Male-pattern baldness and age-related graying limit 
this reason, so enjoy it while you can. 

2. There is more to sunburn in summer and more 
to chill in winter. 

Hair is Mother Nature's furry little hat, so removing 
it implies wearing a real hat or adding sunscreen. The 
good news is that it just takes seconds to add sunscreen 
to a hairless scalp. 

3. Initially, you may bump your head. 

Since you don't have eyes in the back of your head, 
your hair’s sense of touch is your head’s early warning 
system. Until you stop relying on it, your head may 
take a few lumps. Note that in a collision-rich environ- 


Figure 1. 

ACM's President 
(a) before 

(b) after. 
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Figure 2 The bald and the 
brave. (a) Bruce Willis, (b) Sean 
Connery, and (c) Indy meets his 
match (temporarily) in a scene 
from Indiana Jones and the 
Raiders of the Lost Ark. 


(b) 


ment, a hat can play that warning role. 

On the positive side, here are Seven Reasons to 
Shave Your Head: 

1. Women think you're younger. 

The amount and color of hair is a primary indicator 
of a man’s age, and the bald cut is popular with young 
men. Thus, you both remove the major age clue and 
adopt a cool, young haircut. The average estimate from 
female friends is that I shaved 10 years off my apparent 
age (see Figure 1). 

2. Men think you look tougher. 

A story makes this point. A fellow player at my 
weekly soccer match who didnt recognize me com- 
plained aloud “what idiot invited the hard ass.” That 
remark made my day. 

Bald actors are often cast in macho roles, like Bruce 
Willis and Sean Connery. Movie tough guys are often 
bald. (Remember the bald German mechanic who out- 
boxed Indiana Jones until the propeller stopped the 
match?) Such virile film images shape society’s view of 
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baldies (see Figure 2). 

3. You never need to comb your hair. Ever. 

It doesn’t matter if you drive a convertible, wear a 
motorcycle helmet, walk in the rain, or even pull a 
sweater over your head; there are no more bad hair 
days. 

4, It’s the only good-looking, do-it-yourself haircut. 

You save the hassle and cost of going to the barber 
and buying men’s hair products. Moreover, most alter- 
natives are less manly: hair dye, comb-overs, implants, 
wigs, baldness cures, and so on. 

5. Shaving technology has improved. 

State-of-the-art shaving tools have rapidly advanced 
in the last several years. Triple-, quadruple- 
and even quintuple-bladed razors give a close 
shave while making it unlikely to nick your 
scalp, and it appears impossible to cut yourself 
with a cordless electric shaver. This was 
important to me because I don’t want to go to 
work with little bits of toilet paper stuck to 
my head. 

6. Showers are quicker, so is grooming. 
~ It might surprise you to discover how 

much shower time is spent on hair: washing, 
conditioning, drying, and combing. Without 
hair, showers take just a minute or two. 
Hence, the bald cut saves time in your daily routine. 
Moreover, you don’ need to shave your head every day, 
since the hair on your head grows more slowly than the 
hair on your face. Oddly enough, some experts think 
the hair on your head grows more slowly in some sea- 
sons than in others [1, 2]. 

7. Finally, no worries about hair loss. 

For the first time in a man’s life, you look forward to 
getting balder, as it saves time! @ 
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View po iC | Shlomo Argamon and Mark Olsen 


Toward Meaningful 


Computing 


Computer science can revolutionize the humanities as it has the hard sciences 
by revealing and contextualizing the meaning hidden in digital libraries. 


urrent initiatives by 
Cvrn Yahoo, and a con- 
sortium of European 

research institutions to digitize 
the holdings of major research 
libraries worldwide promise 
to make the world’s knowl- 

edge accessible as never before. 
Yet in order to completely realize this promise, 
computer scientists must still develop systems that 
deal effectively with meaning, not just with data 
and information. This grand research and develop- 
ment challenge motivates our call here to improve 
collaboration between computer scientists and 
scholars in the humanities. 

Research in the humanities (such as historical lex- 
icography and textual criticism) focuses on the inter- 
pretation and elucidation of what is meant by a 
particular text or artifact in its context of origin, 
expression, and reception. As with recent computa- 
tional collaborations with biologists and other physi- 
cal scientists that have revolutionized these fields and 
spawned subindustries of computer science (such as 
bioinformatics and physical simulation), collabora- 
tion between computer scientists and humanities 
scholars promises to improve and deepen everyone's 
connection to and understanding of the emerging 
global universe of knowledge. 

Computer science researchers and practitioners 
should jump into this challenge, establishing collab- 


orations with humanities scholars, industry leaders, 


and programmers and IT experts. Moreover, 
research and funding agendas should be realigned to 
include support for the development of humanities 
computing at all levels (such as in tools for dealing 
with highly disparate writing systems or text-naviga- 
tion systems). 

To deal with meaning, one must deal with the 
context of information in the broadest sense. 
Although dealing with simple question answering or 
text retrieval might ignore the distinction between 
meaning and information, more complex tasks 
require approaches that are more aware of the con- 
text of a given text. Context includes the text’s rela- 
tion to other texts, its authors and editors, where 
and under what circumstances it was written and 
published, and even who has read it and what they 
may have had to say about it. What, for example, 
might be the implications of enhancing Wikipedia 
with information linking articles with metadata 
about their authors and editors, even their readers 
and reviewers? For one thing, users would have 
social access to Wikipedia's version histories and thus 
a more complete picture of the development of 
knowledge in Wikipedia. The technical hurdles for 
even this straightforward application are nontrivial, 
demanding careful user-interface design and algo- 
rithmic support. 

Computer scientists must involve humanities 
scholars in such development, bringing to bear their 
centuries of experience making explicit the invisible 
but crucial connections among texts, people, and 
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social environments. The great power of the Web is 
derived in part from the ubiquitous hyperlinks con- 
necting its millions of pages and sites, so parts of its 
knowledge structure become explicit. But most of 
the important connections remain implicit, and 
many more implicit and inaccessible structures will 
need to be teased out as the world’s libraries come 
online. Thus, a fundamental task to be addressed by 
a new humanities/computing collaboration is the 
identification and explication of implicit links, along 
with development of effective interaction modes—a 
new generation of “knowledge browsers.” 

Over the next few years, this effort may involve 
identifying and exploiting connections among simi- 


lems of searching and analysis (such as how to repre- 
sent complex social networks involving individuals 
with a variety of attributes like age, sex, and economic 
class) and with different roles in the literature (such as 
authors, readers, and fictional characters). Nonsocial 
networks of interest also abound, ranging from the 
relatively simple (such as scene locations in a play) to 
the relatively complex (such as thematic units shared 
in the texts of a particular literature). As we develop 
such contextually aware search tools, other techniques 
of computational linguistics, image processing, and 
social network analysis must also be brought to bear, 
aiming to find yet more complex and higher-level 
connections among “idea units” in media. 


A fundamental task to be addressed by a new humanities/computing 
collaboration is the identification and explication of implicit links, 
along with development of effective interaction modes—a new 
generation of “knowledge browsers.” 


lar significant phrases or structures in millions of dif- 
ferent documents or using manually assigned 
metadata (in a Semantic Web [1] sense) to identify 
connections among documents. Such markup is 
indeed important; much effort in humanities com- 
puting today is devoted to developing XML markup 
schemes for the structural and semantic properties of 
texts. For example, the Text Encoding Initiative 
(www.tei-c.org) is developing many encoded data 
sets, ranging from the works of individual authors to 
massive collections of national, historical, and cul- 
tural literatures [2]. However, many of the formal 
assumptions underlying XML markup (such as the 
idea that documents are essentially trees) are insuffi- 
cient for representing truly useful markup (many 
types of textual structures may overlap). Hence, 
future advances toward making the meaning in digi- 
tal literature globally accessible depend on the devel- 
opment of richer structural markup schemes that 
also allow for efficient search and retrieval. 

The complex web of relationships within and 
among texts and their contexts leads to novel prob- 
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The development of such tools for contextually 
aware searching and information processing may rev- 
olutionize humanities scholarship. Scholars will be 
able to interact with texts in new ways, particularly as 
linguistic, visual, and statistical processing give them 
new ways of reading, visualizing, and understanding 
them. 

The converse—using computational methods to 
develop new modes of artistic and scholarly expres- 
sion—is also key. Hypertext, blogging, and new 
forms of digital media and gaming are all being 
applied to humanities research and teaching. These 
emerging networked interactive media represent only 
the tip of the iceberg. 

The emerging global digital library will thus 
require computer scientists to develop new query 
interfaces that represent the textual and contextual 
elements of human meaning. Occurrences of terms 
in documents may be filtered by rhetorical or linguis- 
tic context (such as when a word is used in syntacti- 
cally, semantically, or thematically salient positions). 
Visualization tools must include relevant documents 


on timelines, geographic representation, author char- 
acteristics, or degree of connection to other docu- 
ments. Meanwhile, the contingency and ambiguity of 
meaning in human texts pose problems of how to 
present the content and context of text to the user in 
a comprehensible and manipulable fashion. 

In our expanding information universe, computer 
scientists must ensure that access enhances and 
enriches everyone’s meaningful experience with 
information, rather than dehumanizes it by possibly 
omitting its context. They must join hands with 
humanists, whose concern with explicating meaning 
in all its complexity crosses disciplinary boundaries, 
from literature and philosophy to history and music. 

Such collaboration begins with humanist organi- 
zations, including the Association for Computing in 
the Humanities (www.ach.org) and the Association 
for Literary and Linguistic Computing 
(www.allc.org). The next international Digital 
Humanities conference (www.allc-ach2006.collo- 
ques. paris-sorbonne.fr) will held be at the Sorbonne 
in Paris in July where computer scientists will be able 
to draw inspiration from and suggest solutions to 
problems in humanities computing. 

This work is sure to yield valuable results in infor- 
mation retrieval, computer graphics, natural lan- 
guage processing, data mining, information 
visualization, and human-computer interfaces. The 
result will be that context is added to ubiquitous 
information access, making it both meaningful and 
full of meaning. 
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firms. 

e The ACM Fellows for 2005 are 
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Development Centre. 

e Previews of the International 


Collegiate Programming Contest and 
SIGCSE educational symposium. 


And much more! 
All online, in MemberNet: 
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By RYEN W. WHITE, BILL KULES, 
STEVEN M. DRUCKER, and 
M.C. SCHRAEFEL, Guevt Editors 


SUPPORTIN 
EXPLORATORY ,, ¢ 
SEARCH” 


nline search has become an increasingly 
important part of the everyday lives of 
most computer users. Search engines, 
bibliographic databases, and digital 
libraries provide adequate support for users 
whose information needs are well defined. 
However, there are research and develop- 
ment opportunities to improve current 
search interfaces so users can succeed more 
often in situations when: they lack the 
knowledge or contextual awareness to for- 
mulate queries or navigate complex informa- 
tion spaces, the search task requires browsing 
and exploration, or system indexing of available 
information is inadequate. For example, what if we 
want to find something from a domain where we have a general 
interest but not specific knowledge? How would we find classical music we 
might enjoy if we do not know what Beethoven or Berlioz sound like? 


What about the difference between Baroque and Romantic? What do we 
typ into a search engine? [2]. 


~ ILLUSTRATION BY MARLENA ZUBAR”———T— 
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The answer is we usually submit a tentative query 
and take things from there, exploring the retrieved 
information, selectively seeking and passively obtain- 
ing cues about where the next steps lie. Researchers 
from diverse communities, such as information 
retrieval, user interface design, information visualiza- 
tion, and library sciences have been working on tech- 
niques to support these kinds of queries in what is 
becoming known as “exploratory search.” 

This special section presents the thinking of 
informed scholars and leaders in the field based on 
their work in this area. We present a collection of dif- 
ferent perspectives, reflections, and future visions on 
exploratory search. (The section was inspired by dis- 
cussions at a workshop held at the University of 
Maryland in June 2005 [3]). The recurring themes 
throughout this section include: interactive search 
and browsing, user needs, user models, visualization, 
and dynamic workspaces. 


esigning interfaces to support 
exploratory search presents unique demands over 
designing for searches where the target is well known 
or where a single document or fact will suffice. Sys- 
tems such as the mSpace Explorer, the Relation 
Browser, CitiViz, and the Phlat browser—all 
described in this section—try to make search more 
effective by providing a broader range of interface 
functionality and dynamically updating how search 
results are presented. Other options include the use of 
interfaces employing categorization or clustering (see 
the sidebar by Hearst for more details). 

Most of this research has been prototyped in 
restricted domains that are generally limited in size. 
However, emergent protocols, such as the Semantic 
Grid [1], are promising to enable data to be associated 
and reasoned over in new ways at Web scale. There is 
an imperative, therefore, to design interfaces that will 
take advantage of these new protocols, and that will 
thereby provide new mechanisms to support richly 
associated search across heterogeneous spaces. 

The provision of tools to support the exploration 
of such information spaces can yield great rewards for 
users, especially when contextual factors such as user 
emotion, task constraints, and dynamism of informa- 
tion needs are considered. Determining whether an 
exploratory search system is effective is a challenge in 
itself. No metrics exist to determine how well a system 
supports exploration, yet users will undoubtedly be 
able to tell what works well for them. Although some 
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articles in this section mention evaluation, the issue is 
worthy of more discussion than possible here; our pri- 
mary aim is to provide an overview of advances in 
techniques to support exploratory search. 

The development of new search tools requires 
novel research and collaborative efforts among com- 
puter scientists, social scientists, psychologists, library 
and information scientists, and practitioners who 
may lead the way with novel search applications on 
the Web. The authors featured here leverage their 
skills and experience to provide different perspectives, 
but with the same objective: to improve the search 
experience for computer users. For researchers, this 
section will shine a light on potentially new research 
areas; for practitioners, it will illustrate ongoing work 
and demonstrate why exploratory search is an area 
worthy of attention. 

Defining what constitutes an exploratory search is 
challenging. Indeed, almost all searches are in some 
way exploratory. As many of the examples in this sec- 
tion illustrate, an exploratory search may be charac- 
terized by the presence of some search technology and 
information objects that are inherently meaningful to 
users (for example, their images, email messages, and 
music files). Although there may be circumstances 
where exploratory strategies are used continually to 
allow people to discover new associations and kinds of 
knowledge, they are often motivated by a complex 
information problem, and a poor understanding of 
terminology and information space structure. 


n some respects, exploratory search can be 
seen as a specialization of information exploration—a 
broader class of activities where new information is 
sought in a defined conceptual area; exploratory data 
analysis is another example of an information explo- 
ration activity. In exploratory search, users generally 
combine querying and browsing strategies to foster 
learning and investigation. Although the exploration 
of information to reduce uncertainty is addressed in 
many fields, we focus on three areas: information 
retrieval (how information is found), information 
studies (how needs are described and information is 
used), and information visualization (how informa- 
tion is presented). The articles do not discuss fields 
such as knowledge management and cognitive psy- 
chology. Although both fields contain relevant 
research, they are beyond the scope of this editorial 
project. 

Marchionini opens the discussion with an article 


describing the difference between exploratory search 
and lookup or retrieval search. He argues that 
exploratory search is based on learning and investiga- 
tion, and requires support for browsing strategies, 
whereas lookup or retrieval can be best supported by 
analytic strategies. Marchionini uses the phrase 
“Human-Computer Information Retrieval” to 
describe more active user involvement in the search 
process, and illustrates practices and trends with 
examples from user interaction that support retrieval 
and exploration strategies. 

Fox et al. describe their efforts to support explo- 
ration using the metaphor of stepping stones and 
pathways. Through tailored visualization they make 
searchers aware of the “big picture,” help them make 
new insights, follow alternative pathways, find helpful 
connections, and discover more relevant items. They 
demonstrate the importance of visualization interfaces 
as a natural and efficient way to support exploratory 
search through navigation and hypotheses generation. 

Gersh et al. focus on the role of exploratory search 
in the intelligence analysis process. They advocate the 
use of rich information collections to help informa- 
tion analysts understand, synthesize, and present a 
coherent explanation of world events. Rich informa- 
tion collections and a general concept for using such 
collections in analysis are defined. The authors illus- 
trate how such collections can benefit analysts as they 
explore information represented by complex graphs, 
such as social networks. 

Sidebars are interspersed throughout the section 
briefly describing research and techniques comple- 
mentary to the three feature articles in domains such 
as music, photography, and email. The sidebar by 
schraefel et al. describes mSpace, an interaction model 
and software framework that brings together a variety 
of mechanisms to improve information access by sup- 
porting multiple ways of exploring information. 
Shneiderman et al. describe their research on helping 
searchers find photographs through interface strate- 
gies to help them annotate, browse, and share infor- 
mation. Cutrell and Dumais describe their work with 
the Phlat browser for exploring personal information. 
Jansen describes research on providing automated 
assistance to users engaged in exploration when they 
are likely to be most receptive to it. Hearst compares 
and contrasts clustering and faceted categorization 
approaches to organize information retrieved from 
search engines and facilitate exploration to focus an 
initially vague query and improve retrieval. 

The articles here exemplify the growing interest in 
supporting exploratory search and illustrate, through 
descriptions of research in multiple domains, the 
importance of the challenge. As search technologies 


become more pervasive they will be used more inten- 
sively for an increasingly diverse range of activities. It 
is vital that as user requirements evolve from using 
search for lookup to using it to learn, investigate, and 
explore, that search support tools also evolve. 

As our understanding of how users explore, and 
what they need during this exploration grows, new 
modalities will emerge that use the context of search 
activity and contextual information available in the 
target documents, and beyond, to aid searcher under- 
standing, and reduce uncertainty about the nature of 
the problem and the information being searched. Sys- 
tems will support a diverse range of user search strate- 
gies such as recall and recognition, facet-based search 
and domain selection, and contain workspaces to sup- 
port a spectrum of activities, from unstructured note- 
taking to integrated authoring environments. 

Search is only a partially solved problem. The new 
directions we have proposed represent a subset of the 
grand challenges that await those interested in 
enhancing the search experience for users. Supporting 
exploratory search is an exciting multidisciplinary area 
that will have a profound effect on how information 
is gathered, used, and shared. Rather than just pro- 
viding search results, search systems should help users 
explore, overcome uncertainty, and learn. To accom- 
plish this, researchers and practitioners must leverage 
their skills and experience to develop search systems 
that actively engage searchers by using semantics, 
inherent structure, and meaningful categorization to 
organize intuitive visual workspaces. @ 
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By GARY MARCHIONINI 


EXPLORATORY SEARCH: 


FROM FINDING TO 
UNDERSTANDING 


Research tools critical for exploratory search success 
involve the creation of new interfaces that move the 
process beyond predictable fact retrieval. 


rom the earliest days of computers, search has been a 
fundamental application that has driven research and 
development. For example, a paper published in the 
inaugural year of the JBM Journal 36 years ago out- 
lined challenges of text retrieval that continue to the / 
present [4]. Today's data storage and retrieval 
applications range from database systems that 
manage the bulk of the world’s structured data 

to Web search engines that provide access to 
petabytes of text and multimedia data. As 
computers have become consumer products and the 

Internet has become a mass medium, searching the Pi 
Web has become a daily activity for everyone from 

children to research scientists. 
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As people demand more of Web services, short 
queries typed into search boxes are not robust enough 
to meet all of their demands. In studies of early hyper- 
text systems, we distinguished analytical search strate- 
gies that depend on a carefully planned series of 
queries posed with precise syntax from browsing 
strategies that depend on on-the-fly selections [7]. 
The Web has legitimized browsing strategies that 
depend on selection, navigation, and trial-and-error 
tactics, which in turn facilitate increasing expectations 
to use the Web as a source for learning and 
exploratory discovery. This overall trend toward more 
active engagement in the search process leads the 
research and develop- 
ment community to 
combine work in 
human-computer inter- 
action (HCI) and infor- 
mation retrieval (IR). 
This article distinguishes 
exploratory search that 
blends querying and 
browsing strategies from 
retrieval that is best 
served by analytical 
strategies, and illustrates 
interactive IR practices 
and trends with examples from two user interfaces 
that support the full range of strategies. 

Exploratory search. Search is a fundamental life 
activity. All organisms seek sustenance and propaga- 
tion and Maslow’s classic hierarchy of needs theory 
predicts that once people fulfill basic physiological 
needs, we seek to fulfill social and psychological needs 
to belong and to know our world. These higher-level 
needs are often informational and this in turn 
explains why information resources and communica- 
tion facilities are so sophisticated in developed soci- 
eties. 


hierarchy of information needs may 
also be defined that ranges from basic facts that guide 
short-term actions (for example, the predicted chance 
for rain today to decide whether to bring an umbrella) 
to networks of related concepts that help us under- 
stand phenomena or execute complex activities (for 
example, the relationships between bond prices and 
stock prices to manage a retirement portfolio) to com- 
plex networks of tacit and explicit knowledge that 
accretes as expertise over a lifetime (for example, the 
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most promising paths of investigation for the sea- 
soned scholar or designer). For these respective layers 
of information needs, we can define kinds of infor- 
mation-seeking activities, each with associated strate- 
gies and tactics that might be supported with 
computational tools. 

Figure | depicts three kinds of search activities that 
we label lookup, learn, and investigate; and highlights 
exploratory search as especially pertinent to the learn 
and investigate activities.' These activities are repre- 
sented as overlapping clouds because people may 
engage in multiple kinds of search in parallel, and 
some activities may be embedded in others; for exam- 
ple, lookup activities are 
often embedded in learn 
or investigate activities. 
The searcher views these 
activities as tasks, so we 
use “task” in the following 
discussion. 


anes) Lookup is the most 
Accretion . . 
Analysis basic kind of search task 
Exclusion/Ne ion 
rage and has been the focus of 
Evaluation 
Discauey development for database 
bea dc management systems and 


much of what Web search 
engines support. Lookup 
tasks return discrete and 
well-structured — objects 
such as numbers, names, short statements, or specific 
files of text or other media. Database management 
systems support fast and accurate data lookups in 
business and industry; in journalism, lookups are 
related to questions of who, when, and where as 
opposed to what, how, and why questions. In 
libraries, lookups have been called “known item” 
searches to distinguish them from subject or topical 
searches. 

Most people think of lookup searches as “fact 
retrieval” or “question answering.” In general, lookup 
tasks are suited to analytical search strategies that 
begin with carefully specified queries and yield precise 
results with minimal need for result set examination 
and item comparison. Clearly, lookup tasks have been 
among the most successful applications of computers 
and remain an active area of research and develop- 
ment. However, as the Web has become the informa- 
tion resource of first choice for information seekers, 
people expect it to serve other kinds of information 
needs and search engines must strive to provide ser- 
vices beyond lookup. 


Figure 1. Search activities. 


'There are many important theoretical models of information search, for example, 
y imp: PO Pp 
Saracevic summarizes Belkin’s and Ingrewsen’s in his stratified model [9]. 


Searching to learn is increasingly viable as more pri- 
mary materials go online. Learning searches involve 
multiple iterations and return sets of objects that 
require cognitive processing and interpretation. These 
objects may be instantiated in various media (graphs, 
or maps, texts, videos) and often require the informa- 
tion seeker to spend time scanning/viewing, compar- 
ing, and making qualitative judgments. Note that 
“learning” here is used in its general sense of develop- 
ing new knowledge and thus includes self-directed 
life-long learning and professional learning as well as 
the usual directed learning in schools. Using termi- 
nology from Bloom's taxonomy of educational objec- 
tives, searches that support learning aim to achieve: 
knowledge acquisition, comprehension of concepts or 
skills, interpretation of ideas, and comparisons or 
aggregations of data and concepts. 

Another important kind of search that falls under 
the learn search activity is social searching where peo- 
ple aim to find communities of interest or discover 
new friends in social network systems (for example, 
www.friendster.com). Although the motivations may 
be distinct from other learning search examples, the 
exploratory strategies for locating, comparing, and 
assessing results are similar. Much of the search time 
in learning search tasks is devoted to examining and 
comparing results and reformulating queries to dis- 
cover the boundaries of meaning for key concepts. 
Learning search tasks are best suited to combinations 
of browsing and analytical strategies, with lookup 
searches embedded to get one into the correct neigh- 
borhood for exploratory browsing. 


earches that support investigation involve 
multiple iterations that take place over perhaps very 
long periods of time and may return results that are 
critically assessed before being integrated into per- 
sonal and professional knowledge bases. Investigative 
searches aim to achieve Bloom’s highest-level objec- 
tives such as analysis, synthesis, and evaluation and 
require substantial extant knowledge. Such searches 
often include explicit evaluative annotation that also 
becomes part of the search results. Investigative 
searching may be done to support planning and fore- 
casting, or to transform existing data into new data or 
knowledge. In addition to finding new information, 
investigative searches may seek to discover gaps in 
knowledge (for example, “negative search” [1]) so that 
new research can begin or dead-end alleys can be 
avoided. Investigative searches also include alerting 


service profiles that are periodically and automatically 
executed. 

Serendipitous browsing that is done to stimulate 
analogical thinking is another kind of investigative 
search. Investigative searching is more concerned 
with recall (maximizing the number of possibly rele- 
vant objects that are retrieved) than precision (mini- 
mizing the number of possibly irrelevant objects that 
are retrieved) and thus not well supported by today’s 
Web search engines that are highly tuned toward pre- 
cision in the first page of results. This explains why so 
many specialized search services are emerging to aug- 
ment general search engines. Because experts typically 
know which information resources to use, they can 
formulate precise analytical queries but require 
sophisticated browsing services that also provide 
annotation and result manipulation tools. 

These distinctions among different types of search 
activities suggest that lookup searches lend themselves 
to formalized turn-taking where the information 
seeker poses a query and the system does the retrieval 
and returns results. Thus, the human and system take 
turns in retrieving the best result. However, learning 
and investigative searching require strong human par- 
ticipation in a more continuous and exploratory 
process. 

To support the full range of search activities, the IR 
community is turning increasingly to CHI develop- 
ments to discover ways to bring humans more actively 
into the search process. Rather than viewing the 
search problem as matching queries and documents 
for the purpose of ranking, interactive IR views the 
search problem from the vantage of an active human 
with information needs, information skills, powerful 
digital library resources situated in global and locally 
connected communities—all of which evolve over 
time. The digital library resources are assumed to 
include dynamic contents such as other humans, sen- 
sors, and computational tools. In this view, the search 
system designer aims to bring people more directly 
into the search process through highly interactive user 
interfaces that continuously engage human control 
over the information seeking process. Although this is 
an ambitious design goal, we are beginning to see 
some progress in systems that are the forerunners to 
the exploratory search engines that will evolve in the 


years ahead. 


TOWARD EXPLORATORY SEARCH SYSTEMS 

Menus in restaurants serve the needs of both man- 
agement and diners. From the system point of view, 
menus scope the kinds of products and services 
available and thus optimize performance; and from 
the patron’s point of view they simplify selection and 
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specification of gastronomical needs. In the com- 
puter industry, menus were the first kind of alterna- 
tive to command systems and remain an important 
interaction style for selection and browsing. 
Expandable hierarchical file structures are special- 
ized menus that serve as the mainstay of personal 
computing, cell phone, and PDA 
interfaces. 

Hypertext links in texts were 
called “embedded menus” by Shnei- 
derman [10] and current Web direc- 
tory structures (for example, Open 
Directory) represent sophisticated | 
menu structures for finding infor-_ 
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such as slider adjustments and brushing techniques to 
pose queries and client-side processing to immediately 
update displays to engage information seekers in the 
search process. A number of prototypes (for example, 
Dynamic Home Finder, SpotFire, TreeMaps) have 
come from these lines of research and development. 


LIFELINES: Enhancing navigation and analysis of patient records 
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mation on Web pages. In the data- | 
base realm, query-by-example "=== 
(QBE) interfaces were early alterna- |; 
tives to formal language interfaces | 
and QBE-like systems remain the | 
primary method for supporting | 
non-textual queries in multimedia | 
systems. These interface design 
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experiences demonstrate the efficacy 
of selection as a form of query spec- 
ification, and inspire link navigation as a primary user 
interface interaction style in the Web environment. 


here is also substantial evidence in the 
IR literature that relevance feedback—asking informa- 
tion seekers to make relevance judgments about 
returned objects and then executing a revised query 
based on those judgments—is a powerful way to 
improve retrieval. However, practice shows that people 
are often unwilling to take the added step to provide 
feedback when the search paradigm is the classic turn- 
taking model. To engage people more fully in the 
search process and put them in continuous control, 
researchers are devising highly interactive user inter- 
faces. Shneiderman and his colleagues created 
“dynamic query” interfaces [10] that use mouse actions 
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Figure 2. Open Video 
preview display for a 


saeco vice. These techniques are especially 


good for exploration where high- 
level overviews of a collection and rapid previews of 
objects help people to understand data structures and 
infer relationships among concepts. 

Other researchers have investigated these highly 
interactive interaction styles. Hearst and her col- 
leagues created a series of interfaces that tightly cou- 
ple queries to results, ranging from TileBars for text 
searching [2] to Flamenco (see the sidebar in this sec- 
tion), a series of interfaces that provides hierarchical, 
faceted metadata as entry points for exploration and 
selection. Hearst and Pederson [3], and others (for 
example, [11]) have used clustering of search results 
to make search more interactive, as represented by 
current Web search alternatives such as Clusty 
(clusty.com) that aim to provide groups of results that 
can be used to further search. Fox et al., schraefel et 
al., and Cutrell and Dumais offer other examples in 


Exploratory search makes us all 

pioneers and adventurers in a new world 
of information riches awaiting discovery 
along with new pitfalls and costs. 
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exploratory search. Our work at the University of 
North Carolina parallels these efforts and two exam- 
ple search systems that support exploratory search are 
illustrated here. 


OPEN VIDEO EXAMPLES 
The Open Video Digital Library (www.open- 
video.org) aims to give people agile views of digital 
video files [6]. The Web-based interface provides a 
number of alternative ways to slice and dice the video 
corpus so that people can see what is in the collection 
(overview) and determine greater details about a 
video segment (preview) before downloading it. 
There are different kinds of surrogates provided, 
including textual and visual representations and sev- 
eral layers of detail and alternative display options to 
give people good control. The user interface was 
designed to optimize agile exploration before down- 
loading while allowing standard text-based search. 
A number of user studies were conducted to deter- 
mine which surrogates are effective and what parameters 
to use as defaults. This interface has proven to be quite 
effective over the past few years as thousands of users 
access the corpus each month to find videos for educa- 
tional and research purposes. The home page provides a 
typical search form but also partitions the video collec- 
tion in various ways so that people can select a specific 
partition to explore. Result set pages provide alternatives 


Figure 3. (a) Relation browser interface for 
Open Video Library with mouse over the 
education facet; (b) Relation browser display 
after educational and Spanish selected, mouse 
over fourth title. 


for what is displayed (formats and 
level of text and visual detail) and how 
the results are ordered (relevance, title, 
duration, date, popularity). 

Figure 2 shows a preview for a 
video with textual metadata and up 
to three kinds of visual surrogate 
(storyboard, fast forward, excerpt). 
The searcher may get more details 
by selecting the visual surrogate or 
download a video file in a format of 
their choice. The Open Video search 
system is meant to put people in 
control and support exploration as 
well as lookup. Our transaction logs 
indicate that half of the searches 
conducted begin with keyword 
strategies (analytical strategies) and 
the remainder begin with partition 
selection (browsing strategies). 


s part of our efforts to develop highly 
interactive UIs that support exploratory search for 
government statistical Web sites, we developed a gen- 
eral-purpose interface called the Relation Browser 
(RB) that can be applied to a variety of data sets [5]. 
The RB aims to facilitate exploration of the relation- 
ships between (among) different data facets, display 
alternative partitions of the database with mouse 
actions, and serve as an alternative to existing search 
and navigation tools. RB provides searchers with a 
small number of facets such as topic, time, space, or 
data format; each of which is limited to a small num- 
ber of attributes that will fit on the screen, simple 
mouse-brushing capabilities to explore relationships 
among the facets and attributes; and immediate 
results displays that dynamically change as brushing 
continues. Figure 3 illustrates how the RB works for a 
database such as the Open Video DL. Panel 3a depicts 
a portion of the RB at startup with the mouse posi- 
tioned over the Educational category in the genre 
facet. The number of videos in the library in each of 
the facet-categories is immediately shown along with 
a set of bars that show the distribution visually. Thus, 
simply moving the mouse partitions the full corpus 
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into a view of the educational items. Clicking the 
mouse freezes this partition and allows continued 
browsing or retrieval of the partition from the server. 

Panel 3b shows a portion of the display after the 
user has selected the Spanish language category 
within the educational partition and then clicked on 
the Search button. The display shows the number of 
items in each facet-category for the 41 videos in the 
result set in the upper panel and the titles, keywords, 
and producing agent for the videos in the bottom 
panel with additional metadata available on mouse- 
over. These items are hot linked to the Open Video 
DL. String search within the results fields is also sup- 
ported and all results panel and query panel displays 
are coordinated to update in parallel when any mouse 
or keyboard action is executed. 

The RB has been instantiated for dozens of data- 
bases, including several U.S. federal statistical agency 
Web sites. RB was designed to facilitate exploration 
and is less direct for simple lookup tasks than for 
exploratory tasks. Our user studies have demonstrated 
its efficacy when compared to standard Web-based 
retrieval. To support the dynamics, the metadata and 
query results must be available on the client side, thus 
limiting scalability to databases of roughly tens of 
thousands of items. We see this specialized kind of 
interface as an augmentation of today’s powerful 
lookup engines. The RB could be used as a tool for 
exploring very large databases where the results are not 
individual items but subcollections or portals. Alterna- 
tively, the RB may be used after a standard Web search 
has been executed to investigate the result set if on-the- 
fly automatic classification is used. 


CONCLUSION 

It is clear that better tools to support exploratory 
searching are needed. Oblinger and Oblinger [8] 
argue the “Net generation” (those who learned to 
read after the Web) are qualitatively different in 
their informational behaviors and expectations; they 
multitask and expect their informational resources 
to be electronic and dynamic. The Net generation 
will expect to be able to use Web resources to con- 
duct lookup, learning, and investigative tasks with 
fluid user interfaces. 

As people spend more time online, not only will 
they increase their expectations about information 
tools and content, but there are more opportunities 
for mining their behavior patterns and applying 
adversarial computing that tries to take advantage of 
system and user behaviors. Exploratory search makes 
us all pioneers and adventurers in a new world of 
information riches awaiting discovery along with new 


pitfalls and costs. 
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Today, executing a query in a Web search engine 
not only returns results but targets the searcher for 
many kinds of presumably related opportunities and 
services. Exploratory search will exacerbate this trend 
as more user interaction data will be available for min- 
ing and analysis. One implication of considering 
good Web design that supports exploratory search 
together with client-side applications, like the RB, is 
to provide people with ways to trade off personal 
behavior data for added value services. ‘Those who do 
not want their information behaviors to be mined can 
choose to use more client-side exploration tools, only 
sending requests for database partitions to the server. 

Regardless of where the exploration takes place, it 
is clear that more computational resources will be 
devoted to exploratory search and the next search 
engine behemoths will be the ones that provide easy 
to apply exploratory search tools that help informa- 
tion seekers get beyond finding to understanding and 
use of information resources. @ 
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M S P ACE: Improving Information 


Access to Multimedia Domains with 
MultiModal Exploratory Search 


By M.C. SCHRAEFEL, MAX WILSON, ALISTAIR RUSSELL, and DANIEL A. SMITH 


f you do not know much about 
classical music, how do you discover what 
you might like? The first port of call for most 
people is to Google “classical music.” This 
returns a list of links to sites that provide tex- 
tual descriptions of terms or times in the clas- 
sical music domain, or sometimes sites that 
sell classical music, or review particular 
pieces, or feature beginner’s guides that list 
“must hear” works [1]. While these are all 
useful sources of information, they do not 
enable direct assessment by or access to the 
content of the domain itself. mSpace 
(mspace.fm) is an interaction model [2] and 
software framework [6] that brings together a 
variety of mechanisms to improve access to 
information by supporting multiple ways of 
exploring the information itself. 

The mSpace model conceptualizes infor- 
mation as a set of dimensions within a 
domain. In order to manage these high- 
dimensional spaces, we represent only a subset 
at a time, called a “slice.” These are arranged 
from left to right, in columns, creating a hier- 
archy, where the left-most column is the top 
level of the hierarchy and the right-most is at 
the bottom. Instances associated with a 


dimension are populated into a column. The 
accompanying figure shows the selection of 
Baroque in the Era column has restricted the 
instances that appear in the Composer column 
to those composers of the Baroque era. 

The slice is dynamic: it can be altered by 
rearranging, adding, or subtracting dimen- 
sions, enabling individuals to determine how 
they wish to organize the domain to support 
their interest. For instance, someone interested 
in piano pieces can replace the Era dimension 
(default first column) with Arrangement. 
Selecting Piano will then mean that the follow- 
ing column—Composers—will show only 
those composers who created works for Piano. 
Providing these kinds of manipulations is what 
Marchionini refers to as support for “slicing 
and dicing” [3] information for user-deter- 
mined exploration. 

mSpace represents slices within a multi- 
column “faceted” (see Hearst in this section) 
spatial browser that presents persistent con- 
textual information around items of 
interest (see the figure). This approach 
facilitates “information triage” [4]. Peo- 
ple can quickly browse the domain, see 
information about a selected item, see 
which other items are in the same area, and 
easily add the selected item to a collection 
space for later attention. 

The terms in a given dimension (Baroque, 
symphony, serenade), nicely organized though 
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they are, may mean nothing to a neophyte explorer. 
mSpace provides preview cues [5] to help address this 
problem. A preview cue is a lightweight mechanism 
that provides a multimedia preview of the area cur- 
rently being triaged in the interface. In the classical 
music example, hovering over cues associated with 


Baroque will trigger pieces that are from 
that period. In this way, preview cues are 
distinct from current music store soft- 
ware, which provides audio samples only 


SimSpace 
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estate is highly constrained. 

The mSpace framework itself is built on Semantic 
Web technologies. The Semantic Web enables diverse, 
heterogeneous (that is, Web-type) resources to be 
reused in new contexts. Thus, an mSpace can pull 
together a variety of resources into new contexts by 


with a given piece. As well, preview cues 
foreground the exploration of content 
through examples of the content itself{— 
in this case, the music—rather than the 
information about it, like Composer, 
Period, and so on. 
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category rather than the instance level 
alone, people are significantly better able to explore a 
domain and make selections within it [5]. At the sys- 
tem level, there are many methods to select the cues 
to associate with a category: expert recommendations, 
collaborative filtering, feature detection, and even 
random selection. Our own research suggests that 
even if the system uses only simple random selection 
to select a preview cue within the set of available 
pieces for that category it is both highly effective and 
scalable. In other words, any valid preview cue is sig- 
nificantly better than no cue at all. To date, we have 
explored the use of preview cues for music and video; 
audio cues and video cues respectively. One of the 
advantages of preview cues is they provide an addi- 
tional dimension for exploration with low impact for 
screen real estate requirements. This low screen usage 
for high value assessment has been particularly help- 
ful when porting the mSpace interface to mobile 
devices (www.mspace.fm/mobile), where screen real 


The mSpace Explorer. A slice 
through the information space is 


leveraging their associated 


shown with four dimensions: semantics. Increasing 
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from the right of composer to the TeSOULCes, such as the 


left, rearranging the slice. 


Open Guides, blogs, and 
news RSS streams, publish 
their data as RDE Other open sources such as 
Wikipedia can be easily translated into RDF—the 
W3C standard for mark-up language for semantic 
description of resources. With Semantic Web tools 
like mSpace, a source such as Wikipedia, when pre- 
senting information on Beethoven, can use either an 
mSpace about historical figures or a classical music 
mSpace featuring composers. These semantics also 
suggest that multiple mSpaces can be interconnected, 
to enable moving between domains as readily as moy- 
ing between dimensions within a domain. The het- 


erogeneity of domains will, in the near future, enable 
g 


mSpace is an interaction model and software 
framework that brings together a variety of 
mechanisms to improve access to information 
by supporting multiple ways of exploring the 
information itself. 
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people to choose information sources they trust to 
populate the interface. 

We also are integrating _ lightweight 
annotation/publishing methods to enable people to 
make comments on their discoveries and publish 
these comments so they too can add to information 
that can be explored about attributes of a domain. In 
the architecture, we are also investigating models from 
peer-to-peer to trackback for automatic publishing to 
an mSpace about relevant resources or for it to dis- 
cover appropriate resources. 

The purpose of mSpace is to improve access to 
information by supporting exploration of domains. 
mSpace does this by providing a variety of mecha- 
nisms—slice rearrangement, preview cues, collection 
building, information views, and contextual spatial 
displays—to let people determine how they wish to 
organize the space to best support their focus of inter- 
est. The long game plan for mSpace is to evolve into 
one of those interfaces that can both help discover 
these resources, bring them together dynamically, and 
enable a rich palette of mechanisms to explore and 
share the discoveries made on such expeditions. 
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EXPLORING PERSONAL 
INFORMATION 


By EDWARD CUTRELL and SUSAN T. DUMAIS 


hen we look for infor- 
mation in large or unfamiliar sources such as 
the Web or an encyclopedia, it is almost second 
nature to use a search engine to help us find 
what we need. Yet until recently, people have 
had to be more creative when trying to find 
information on their own computers; locating 
an email message, a bank statement, or a short 
movie clip of a friend on your own computer 
can be an exercise in patience and luck. While 
research in the area of personal information 
access has been ongoing for several years [1, 3], 
we are only now beginning to see widespread 
use of such tools. Rich search and browsing 
capabilities to support exploration are now 
being built into the next generation of PC oper- 
ating systems (for example, Apple’s Spotlight for 
Tiger OS X and Microsoft’s Vista Search) and 
are also available in a variety of standalone desk- 
top search tools. 

One might wonder why people would need 
to “explore” their own information. An impor- 
tant reason is that it is difficult for users to 
unambiguously specify what they looking for, 
even in their own collections (as described by 
Marchionini in this section). To make matters 
worse, human memory is often vague and 
dependent on context. Therefore, users must be 
presented with a wide variety of techniques to 
articulate and refine information needs (for 
example, keywords, document summaries, or 
metadata about document content and use). 


Since personal collections are often stored 
locally, highly dynamic interfaces can be used to 
help users quickly iterate queries and explore 
their content. 

Designing interfaces for accessing personal 
information presents several unique challenges 
and opportunities. An important feature of per- 
sonal collections is that people are familiar with 
many details and characteristics about their 
information, as well as the contexts surrounding 
their use of it. When looking for personal infor- 
mation, you may remember the general topic of 
the item, who it was from, where you filed it, or 
roughly when you saw it. The challenge is in cre- 
ating a user interface that exploits the wide and 
varied associations and contextual cues that peo- 
ple remember about their information, while 
maintaining the simplicity of keyword search 
that makes Web search so powerful and easy. 
Such an interface must support search as well as 
browsing among many different kinds of meta- 
data. Metadata can serve as a query (for exam- 
ple, “find me email that I saw yesterday”), or as 
a cue allowing a user to recognize an item more 
easily. In many ways this is similar to the cate- 
gory interfaces described by Hearst in this sec- 
tion, providing an organizing context for results 
and future queries. 

Personal information has several other char- 
acteristics that are important in designing effec- 
tive interfaces. Personal information cuts across 
the many “silos” of information that exist today. 
For example, the address of a business contact 
may be in your address book, an email message, 
a white paper document, or even in the browser 
history. Interfaces that focus on only a single 
domain, such as, email, documents, music, 
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photos, or real estate, are able to leverage the 
uniform nature of metadata to tailor the inter- 
face for that domain, (for example, see Shnei- 
derman et al. in this section). This is much 
more challenging when designing for a variety 
of information sources. While thumbnail 
images can be critical for finding photos, they 
are almost useless for email; and the author 
may be important for manuscripts, but does 
not even exist for Instant Messag- 


what she sees and remembers until she finds it. 
Phlat allows for the fluid exploration of a per- 
son’s own content using any of these cues. 

We have deployed Phlat to about 500 volun- 
teers at Microsoft and are currently studying 
how people use it. As the guest editors of this 
section note, evaluation of such systems can be 
very challenging. By studying detailed interac- 
tion logs and 


ing conversations. Some kinds of 
metadata such as peo- 
ple and time are partic- 
ularly useful —_for 
retrieving your own content [1]. 
Finally, metadata comes from many 
different sources—some 
are inherent properties of 
the objects (file type), 
some are activity based (how many 
times you've looked at it), and oth- 
ers are explicitly added by users (tags 
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or folders). Tags are particularly 
interesting because they are explic- 
itly added to aid future use of the content [2]. 

In an effort to explore this UI challenge, we 
created the Phlat interface for personal search 
(see the figure). Phlat combines keyword search 
and metadata browsing in a seamless manner, 
allowing people to quickly and flexibly find 
information based on whatever they may 
remember about the information they are 


looking for. In addition, Phlat provides a facil- 


ity for tagging items with a uniform system of _ their own information as it is for them to acquire 


user-created metadata. Such tagging enables 
people to add information they think will be 
useful in getting back to their content. 

A key to the design of Phlat is the tight cou- 
pling of searching and browsing. Rather than 
viewing search and browse as separate behav- 
iors, Phlat treats them as two ends of a smooth 
continuum of exploratory search. To reinforce 
this unification, keyword and metadata search 
terms “look” very similar and are located in the 
same query box (in the upper left of the figure). 
A searcher looking for a photo may remember 
the name of the person who sent it to her, the 
approximate date it was taken, or perhaps some 
text in an email message about the photo. From 
any broad starting point, she may then rapidly 
filter, sort, and iterate on her query based on 
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how people want to 
interact with rich 
search systems for their own content. Terabytes of 
personal storage will be commonplace in a few 
years. With systems like Phlat, we hope to make 
it as easy for people to find, explore, and share 


it in the first place. 
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By EDWARD A. Fox, FERNANDO DAs NEVES, 
XIAOYAN YU, RAO SHEN, SEONHO KIM, 
and WEIGUO FAN 


EXPLORING THE 


COMPUTING 
LITERATURE 


WITH VISUALIZATION 
AND STEPPING STONES 
& PATHWAYS 


An alternative interpretation of quertes involves 
yplitting a query into parts that cover different 
but connected aspects of the information needed. 


ust as explorers need maps and compasses to aid their travel, 
so too do computing professionals need effective aids for 
exploring the computing literature. A modest number of 
efforts have aimed to address these needs of our community. 
For example, the Networked Computer Technical Refer- 
ence Library (www.ncstrl.org) grew out of efforts in the 
early 1990s to collect technical reports from academic 
and research departments. In the late 1990s, the Comput- 
ing Research Repository (arxiv.org/corr/) was developed as a 
supplement, allowing (co)authors of papers on computing to 
self-archive, that is, submit to the arXiv.org e-print archive. 
Yet while these services play useful roles, their focus is on col- 
lecting and archiving, rather than on supporting exploration. 
Today, large content collections in the computing field can be 
searched and accessed through sites like the ACM Digital Library 
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(portal.acm.org/dl.cfm), the IEEE Computer Society 
Digital Library (www.computer.org/portal/site/csdl/), 
and the Digital Bibliography & Library Project 
(dblp.uni-trier.de/). Their services are typical of mod- 


ern digital library systems. Additional services are pro- 
vided by CiteSeer (www.citeseer.ist.psu.edu), which 
leverages powerful analysis methods to extract cita- 
tion data from online documents, and then allows 
citation counts and links to be considered, to broaden 
browsing and searching. 


n this article we provide illustrations of how 
richer support for exploratory search can be afforded 
to the computing community, building upon a long 
history of investigations at Virginia Tech (for exam- 
ple, [5]), most recently related to our work with the 
Computing and Information Technology Interactive 
Digital Educational Library (CITIDEL; 
www.citidel.org). We begin with a brief explanation 
of how clustering can aid exploration (also see 
Hearst’s article in this section). Then, we describe 
how a combination of methods facilitates exploration 
that can involve rapid switching between searching, 
browsing in the ACM Computing Classification Sys- 
tem, following links between works, and selecting 
points in a scatter-plot (2D grid) according to any of 
a number of pairs of facets. Finally, we explain how 
Stepping Stones & Pathways (SSP) integrates visual- 
ization, clustering, and Bayesian inference to support 
exploration and the resolution of complex informa- 
tion needs that can be met by sets of related docu- 
ments. Preliminary results suggest SSP also may 
facilitate searching when faced with difficult queries, 
for which no current information retrieval system 
yields high effectiveness. 


CLUSTERING RESULTS 

CITIDEL has approximately a half-million meta- 
data records, with more to be added. It was devel- 
oped to support teaching and learning in the 
computing field, as a part of the National Science 
Digital Library (www.nsdl.org). Because education- 
oriented systems should support exploration, we 
turned to clustering integrated with visualization to 
help in that regard [6]. A basic problem is that for 
people unfamiliar with a large collection of litera- 
ture, such as occurs in the computing field, the 
results of a search are too voluminous to manage. 
When the results are organized well, and a smaller 
number of groups can be examined instead of a very 
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long list of disparate works, users can save time 
when they explore. Our usability studies have shown 
that even such a simple approach can be beneficial 
[6]. Figure 1 illustrates the clusters identified by 
applying the Carrot2 system [7] to the results of 
searching CITIDEL for “digital library.” 

Selecting an interesting cluster, for example, 
“Library of Congress,” in the expandable list on the 
left of Figure 1 causes all of the four documents 
belonging to that cluster to be displayed on the right; 
one can explore among clusters, and then among the 
documents in a cluster. 


INTEGRATED VISUALIZATION 

Digital libraries often have a variety of types of 
information that can be leveraged in an integrated 
manner to provide even richer support. We devel- 
oped CitiViz as a front-end to CITIDEL, combin- 
ing searching, text mining, and information 
visualization [8]. As can be seen in Figure 2, it 
applies two major visualization techniques—a 
hyperbolic tree of a hierarchical classification system 
and a 2D scatter-plot graph. 

The visual interface provides overviews of retrieved 
results from CITIDEL, or a subcollection thereof, in 
this case from metadata provided from the ACM Dig- 
ital Library. Figure 2, in the upper left, shows the 
results for the query “object-oriented programming.” 

The retrieved results are classified according to the 
ACM Computing Classification System (1998 Ver- 
sion; www.acm.org/class/1998). A hierarchical view 
of that subset of the classification system which cov- 
ers the retrieved results is shown using a hyperbolic 
tree in the upper right of Figure 2. Users can explore 
through “focus + context” navigation, that balances 
the amount of local detail and global context pre- 
sented. Each node represents a category in the system; 
nodes have attached a bubble whose size reflects the 
number of documents that have that category 
assigned. 

If a user selects a bubble in the hyperbolic tree, the 
documents associated are plotted in the middle right, 
using a 2D graph, whose axes are user selectable, in 
similar fashion to the visualization built into our 
Envision system [5]. Here the facets of interest are 
date and rank, which is based on estimated relevance. 
However, instead of having dots for each retrieved 
document, or simple icons as was allowed with Envi- 
sion, a tower is used to represent a document. If a 
tower is selected, that tower is shown, somewhat 
larger, in the lower left, where it is clear which cate- 
gory is assigned to each level of the tower. Also, if that 
document is connected with another document 
through a citation link, that link is illustrated, so both 
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documents in the pair are identified. Metadata details 
about the selected document are shown at the bottom 
right. 

Categories selected while browsing the hyperbolic 
tree also are shown in the left middle. If one of these is 
selected, a list appears below it, with a title for each of 
the works to which that category is assigned. We have 
multiple views of the search results, with varying levels 
of detail, and with several different types of contextu- 
alization. Those exploring the collection can combine 
searching, following citation links, navigating through 
category systems, organizing results according to a 
number of different facets of interest, and choosing 
varying levels of detail. Our experiments have shown 
this approach to be powerful and helpful [8]. 


STEPPING STONES & PATHWAYS 

Both of the approaches to aid exploration described 
here begin with the results of a search. However, peo- 
ple exploring a collection may have trouble with 
searching, and may 
need some help to get 
started in that activity. Sar sree, 
Our SSP system aims 
to address such situa- 
tions [2, 3]. It provides 
special support for 
those with complex 
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SSP is more powerful than Figure 1. Carrot2 
conventional search systems Clustering results for 
“digital library.” 


since it loosens a key assumption 
built into other search engines. Those systems assume 
that users want to find one or more individual docu- 
ments and that each should satisfy the user’s informa- 
tion need. In real life, however, a complex information 
need may be satisfied only by a set of documents: 


¢ When preparing a meal from scratch, one might 
need separate recipes for different parts, for exam- 
ple, the salad and the salad dressing; 

¢ When preparing a syllabus for an advanced 
course, one might want a series of related supple- 
mental readings, that use common terminology, 
and that flow well together; 

¢ When identifying a treatment for an unusual set 
of symptoms, one might need papers that relate 
symptoms to illnesses, as well as results of clinical 
trials of treatments for illnesses; and 


object oriented programming| | Search | Help | Titlelist 


ee ge 
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of, and navigation 
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connect sequences of 
related topics. 


> 0.3.2: Language Classifications 


Figure 2. Visualization 
of “object-oriented 
programming” results 
through CitiViz. 
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Typea two topic query here: and in [Operating System =] Search | 
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a 1 | Differences between Distributed and Parallel Systems 
—=. *| Is OpenMP for Grids? 
im | Towards OpenMP Executionon Software 
Ny, Distributed Shared Memory S: 
Update Protocols and Cluster -Based Shared Memory 
»g: | Implementing Object -Based Distributed 
parallel | Shared 


a A Comparative Study of Distributed Shared Memory 
System Design Issues 


Figure 3. An illustration of 
Stepping Stones & Pathways 
software: (a) the top portion is 
a simplification of a screen 
dump, with labels added, 
when searching the operating 
system collection with the 
pair of topics parallel and 
distributed. It shows three 
stepping stones: message 
passing, communication 
overhead, and shared memory. 
(b) Our simplified summary of 
another part of the results, for 
the first stepping stone. In the 
Venn diagram the left circle 
represents end stone 1, the 
right circle represents end 
stone 2, and the yellow circle 
at the top represents the first 
stepping stone, message 
passing. The table gives titles 
for the documents shown in 
the Venn diagram. 


¢ When exploring a collection to help with the 
research for a report or paper, one might need a 
sequence of papers that each cover part of the 
path from problem to hypothesis to approach to 
design to implementation to solution. 


olving such complex needs is best 
accomplished by humans and computers working 
together. As users are aided in breaking down com- 
plex information needs into parts, as they see what 
terminology is used for each of the subtopics that 
emerge during the work on partial solutions, and as 
they see how subtopics are related, they learn more 
about the collection and area of interest, as well as 
gain facility in problem solving. Thus, SSP provides 
rich support for exploration and discovery. 

Our development of SSP was inspired by the early 
work of Swanson [9]. Originally, Literature-based 
Discovery [4] was used to find connections between 
“cures” for “problems” in the medical domain and to 
discover novel connections between seemingly unre- 
lated topics. New techniques may help us assess how 
informative a particular connection is in diverse set- 
tings [11]. Within the framework of Literature-based 
Discovery, SSP can support a closed discovery model, 
where the two topics to connect are given by the user, 
or an open discovery model, where the two topics are 


56 = April 2006/Vol. 49, No.4 COMMUNICATIONS OF THE ACM 


inferred from the query and the collection content. 

Thus, SSP can work in two modes. In the first 
mode, the user provides two subqueries instead of a 
single, complex, information need statement. In the 
second mode, the user provides a single statement, 
and the computer helps with “query splitting,” find- 
ing two subqueries. In either case, SSP can proceed 
with two subqueries, and then work to find the path- 
ways that connect those subqueries. Figure 3 illus- 
trates a simple case of SSP working with the two 
subqueries parallel and distributed, to support explo- 
ration in a focused, educator/learner-oriented, collec- 
tion of papers on operating systems. 

We use the analogy of crossing a shallow stream, 
where there are stones on either bank, and stepping 
stones in the stream, so that people can follow different 
pathways to cross. The two original subqueries are like 
the stones on the banks, that is, the end stones. SSP 
begins with those, retrieves documents for each sub- 
query, and then begins its analysis. In Figure 3b, the 
small circle on the left indicates the documents 
retrieved for end stone 1, while the large circle on the 
right indicates the documents retrieved for end stone 2. 

Sometimes there are documents retrieved by the 
system for both original subqueries. For example, 
Document 1 in Figure 3b, would also be found by a 
more conventional search engine. Each one of the 
found documents fully satisfies the user’s information 
need. Since our collection is small and focused, in 
keeping with the trend toward focused crawlers, 
avoiding the distractions resulting from searching 


over the Web or in collections like INSPEC, we find 


few documents satisfying Et i SET collections about qwo other 


both subqueries. Operating EndStonel message passing Implementing Object- topics: data mining and 
E | h System based Distributed Sha : fe : . al ae bl 
ortunately, there GEG area neeworls Macey a information retrieval. Table 
often are documents BT ss ccavel Acamijtatnen (| gives examples of the 
retrieved for end stone 1 Protocol for Distributed operation of SSP with each 
that are linked to, or can of these three collections— 
be connected through Heke Eeesemmee Pater discovery = Discovery « each prepared in connec- 
clustering or other means Stepping Stone application of data mining Web Dat tion with our work on 
to, some document End Stone? personalization Discovery of Aggregate = CITIDEL, and using Cite- 
retrieved 7 end pane i ie to ‘ali eerie 
In Figure we see that ata. For the sake of brevi 
8 3b, Information A probabilistic model of ° Ys 
documents 2 and 3 are Retrieval information retrieval we show only the end 


retrieved, respectively, for 
end stone | and end stone 
2, and are related; thus, 
the pair of documents 
2-3 may satisfy the user’s information need. The same 
applies for document pair 4—5. 

Looking again at Figure 3a, we see in the middle 
what we call stepping stones. A stepping stone has a 
label, and represents a group of documents, each of 
which can be described in part by that label. That 
group is related, by citation, or by a high similarity 
value, to other groups. There are documents about 
parallel that are related to 
documents about message 


development and status stones and a single stepping 


stone. In the third column 
we give titles of documents 
that provide support. For 
the operating system collec- 
tion, we see a pair of docu- 
ments that allow us to 
address an information 
need that involves both message passing and remote 
procedure calls. In particular, Implementing Object- 
based Distributed Shared Memory relates to two top- 
ics: message passing (the first end stone) and area 
networks (our stepping 


An information retrieval 
logic model implementation 
and experiments 


Table 1. Examples for 
operating system, data 
mining, and information 
retrieval collections. 


DN BAe SRT ERA stone). Also, A Causally 


passing. There also are ee Papen outa Consistent Protocol for 
IS power In the , Tish, river, Oregon, ‘Cal . . 

documents about message Pacific Pacific northwest _specie, spawn, ocean, Distributed Shared Mem- 

assing that are related to | Roftyest) caused to salmon’ water California, marine, ory relates to end stone 2, 

mak fisheries? Idaho, habitat, wildlife ey : 

distributed. Hence, mes- : es but also is related to docu- 

quilts, In what ways have —_ quilt, deduct, median, income, family, tax, 

sage passing is a topic that income quilts been used to jacket taxpay, house, house- ments about area networks. 
is a stepping stone con- ial vicuks fe The two documents 
necting the two end high, invest, rate together reflect a pathway 
stones; we have learned 4 disease, arthritis, patient, lyme, diabetes, between the end stones, 

desu rheumatoid, di h, biotechnology 
* isease U Ol rug, researc! It INO} iu 
through exploration, and races health, infect, medical, and together provide an 
the help of SSP, that this is se lye nec ea aniag answer to the user’s infor- 
a bridging topic for the = symptom, doctor mation need. 
i as 
subqueries. The pathway (jim | Gemeente (cee Gurion) femme murs Eventually, users go 
4-5-6 in Figure 3b illus- (eS Werebhotebweve> beyond exploring topics 
trates how that stepping Seneca eer and want concrete informa- 
or tf 


stone effects a connection 
between the end stones. 

SSP allows exploration at the level of browsing 
through such networks, seeing what topics connect 
other topics. Additional insights can be developed 
about the collection by requesting SSP to expand a 
pathway. For example, one could request an expansion 
to find topics that link end stone 1, parallel, to stepping 
stone 1, message passing. Such exploration can continue 
as long as there are enough documents in the group and 
sufficient connectivity améng the documents. 

Originally, we tested SSP using a collection of doc- 
uments about operating systems. To ensure its gener- 
ality, we also tested it with education-oriented 


tion, that is, they want to 
see the documents that sup- 
port a particular stone, and 
sequences of documents 
that correspond to a path- 
way. SSP provides a variety of presentations of the 
stepping stones, pathways, and documents that relate. 
Often one can find a single document (see the bottom 
part of Table 1), pairs of documents (see the top and 
middle of Table 1), or even triples of documents that 
can support an information need. 

Our usability tests have shown that SSP finds con- 
nections between query topics not found by users, as 
well as saves them time during exploration. Further 


Table 2. Examples of results 
of query splitting. 
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testing is under way. Readers are invited to use the 
subqueries shown in Table 1 for the three collections 
so far prepared, working with our online demo [2], 


and to explore with SSP. 


Query SPLITTING FOR SSP 

Some users of SSP may not know enough, since they 
are just beginning to explore, to give the system two 
initial subqueries. Similarly, when a difficult query is 
given, for which all current search systems fail [1], it 
would be very helpful to split that query into two 
parts with which SSP can operate and yield good 
results. 

A number of difficult queries have been identified 
in connection with analysis of failures when searching 
against large collections, as called for in an annual 
competition for information retrieval techniques and 
systems—TREC (Text REtrieval Conference) [10]. 
We tested our query splitting methods using queries 
and judgments of which documents are relevant to 
those queries [10]. 

We picked queries identified as difficult to answer 
by a majority of retrieval systems [1, 10]. Extensive 
experimentation with those queries has allowed us to 
pick the best method among the several we consid- 
ered, and to tune its parameters to further improve its 
performance [12]. 

Table 2 illustrates our query splitting results for 
some of those difficult queries [12]. The first column 
is a short summary of the query; column 2 gives a 
more detailed description. Our algorithm returns pairs 
of subqueries, shown in the last two columns. These 
are found through a query expansion process that 
starts with the original query, along with documents 
retrieved for those queries, and then tries to separate 
the many terms associated, into groups about different 
topics. We have explored a variety of approaches as 
well as indicators to assess if the subqueries are differ- 
ent but related [12]; the sample of results shown indi- 
cates that plausible splits are possible. 


CONCLUSION 

Exploratory search takes many forms and can be 
aided by a variety of methods. Our work with digi- 
tal libraries and information retrieval has included 
tests of such methods in their use with collections in 
the computing field, as well as large standard collec- 
tions connected with TREC [10]. Our experiments 
have shown that visualization is a key ingredient in 
many of the recipes for exploration. Browsing 
through results clusters, when Carrot2 [7] is com- 
bined with our CITIDEL system, can help. More 
support is provided by integrated searching, brows- 
ing, and visualizing, as with CitiViz [8]. For more 
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complex information needs, visualization connected 
with our Stepping Stones & Pathways [2, 3] 
approach, along with query splitting for complex 
and/or difficult queries [12], can leverage even more 
sophisticated algorithms to find sets of documents 
that in combination support more advanced types of 
exploration. 
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CLUSTERING versus | 


| FACETED CATEGORIES FOR | 
INFORMATION EXPLORATION 


By MARTI 


nformation seekers often express a 
desire for a user interface that organizes search 
results into meaningful groups, in order to 
help make sense of the results, and to help 
decide what to do next. A longitudinal study 
in which participants were provided with the 
ability to group search results found they 
changed their search habits in response to 
having the grouping mechanism available [2]. 

There are many open research questions 
about how to generate useful groupings and 
how to design interfaces to support exploration 
using grouping. Currently two methods are 
quite popular: clustering and faceted categoriza- 
tion. Here, I describe both approaches and 
summarize their advantages and disadvantages 
based on the results of usability studies. 

Clustering refers to the grouping of items 
according to some measure of similarity. In 
document clustering, similarity is typically 
computed using associations and commonali- 
ties among features, where features are typically 
words and phrases [1]. One of the better 
implementations of clustering of Web results 
can be found at Clusty.com.'! 

The greatest advantage of clustering is that it 


A. HEARST 


is fully automatable and can be easily applied to 
any text collection. Clustering can also reveal 
interesting and potentially unexpected or new 
trends in a group of documents. A query on 
“New Orleans” run on Clusty.com on Sept. 
16, 2005 (shortly after the devastation wreaked 
by Hurricane Katrina), revealed a top-ranked 
cluster titled Hurricane, followed by the more 
standard groupings of Hotels, Louisiana, Uni- 
versity, and Mardi Gras. 

Clustering can be useful for clarifying and 
sharpening a vague query, by showing users the 
dominant themes of the returned results [2]. 
Clustering also works well for disambiguating 
ambiguous queries; particularly acronyms. For 
example, ACL can stand for Anterior Cruciate 
Ligament, Association for Computational Lin- 
guistics, Atlantic Coast Line Railroad, among 
others. Unfortunately, because clustering algo- 
rithms are imperfect, they do not neatly 


group all occurrences of each acronym into oP 


one cluster, nor do they allow users to 

issue follow-up queries that only return 

documents from the intended sense (for 
‘<9 3 » bs 

example, “ACL meeting” will return meet- 

ings for multiple senses of the term). 


‘Some of Clusty’s power comes from performing metasearch and showing 
only the top-ranked results. This function alone can produce improved 
results, since it combines the power and judgment of several different 
search engines’ rankings. 
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An underappreciated aspect of clusters is their util- 
ity for eliminating groups of documents from consid- 
eration. This result is supported by participant 
comments found in several studies [2, 3]. For example, 
if most documents in a set are written in one language, 
clustering will very quickly reveal ifa subset of the doc- 
uments is written in another language. 

The disadvantages of clustering include their lack of 
predictability, their conflation of many dimensions 
simultaneously, the difficulty of labeling the groups 
(Clusty.com’s top-level labels are among the best 
implementations), and the counterintuitiveness of 
cluster subhierarchies. Some algorithms [2, 8] build 
clusters around dominant phrases, that make for 
understandable labels, but whose contents do not nec- 
essarily correspond to those labels. 

To illustrate these weaknesses, con- 
sider a recipe example, chosen because 
the relevant dimensions are familiar to 
most people and because exploration 
and browsing are natural tasks for 
recipe collections. A search for 
“chicken recipes” on Clusty.com (also 
on Sept. 16, 2005) turns up the fol- 
lowing motley assortment of groups: 


Salad 

Crockpot 

Chicken Breast 
Barbeque/Grilled 
Soup Recipes 
Healthy 

Lowfat 

Easy Chicken Recipes 
Ttalian 


This list is incomplete and inconsistent. Why 
Crockpot and Barbeque/Grilled, but not Baked and 
Fried? Why Chicken Breast but not Legand Wing Why 
Salad and Soup but not Main course Why Italian 
recipes but not Indian, Thai, or French? Furthermore, 
drilling down into the hierarchies rarely reveals intu- 
itive results. The 29 documents listed under Salad are 
organized by the labels: 


Complete selection of Trusted Chicken Recipes 
Cakes 

Better Homes and Gardens 

Collection 

Share 

Boneless Chicken Breast 

Pasta Salad 


and so on. Only Pasta Salad really belongs here as a 
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potato (1) 


label; it does not make sense for Boneless Chicken 
Breasts to appear in this cluster rather than in the 
Chicken Breasts cluster, and clearly Cake belongs in a 
Dessert category alongside Salad and Soups. 

These kinds of errors are quite typical for clustering 
output. Usability results show that users do not like 
disorderly groupings like these, preferring understand- 
able hierarchies in which categories are presented at 
uniform levels of granularity [4, 5]. 

Hierarchical Faceted Categories. A category system 
is a set of meaningful labels organized in such a way as 
to reflect the concepts relevant to a domain. They are 
usually created manually, although assignment of doc- 
uments to categories can be automated to a certain 
degree of accuracy. Good category systems have the 
characteristics of being coherent and (relatively) com- 


plete and thus pose an eee Wann ned aa 
advantage over the unpre- metadata (partial results). 
dictable results of clustering; 

the studies that compare the two find that participants 
prefer categories [4, 5]. 

A question arises as to what kind of category struc- 
ture is most effective for exploration and browsing of 
information collections. There is increasing recogni- 
tion that strictly hierarchical organization of categories 
is impoverished for these uses. 

An alternative representation, intermediate in com- 
plexity and very rich in flexibility, has become influen- 
tial over the last few years. This representation is 
known as hierarchical faceted categories (HFC) [7]. 
The main idea is quite simple. Rather than creating 
one large category hierarchy, build a set of category 
hierarchies each of which corresponds to a different 
facet (dimension or feature type) relevant to the col- 
lection to be navigated. In the case of chicken (and 
other) recipes, these category hierarchies can include 


Dish Type (Main, Soup, Salad, Side, Dessert), Ingredi- 


ent Type (Meat, Vegetables, Grains, Spices), Cooking 
Method (Bake, Fry, Grill, Easy), Cuisine Type (Italian, 
Indian, French), and so on. Each facet has a hierarchy 
of terms associated with it. 

After the facet hierarchies are designed, each item in 
the collection can be assigned many labels from the 
hierarchies. Thus a recipe for “Chicken Noodle Casse- 
role” might be assigned: 


Dish Type > Pasta 
Preparation Type > Baking 
Meat > Poultry > Chicken 
Vegetables > Celery 
Vegetables > Carrot 


and so on. Our research group has been investigating 
how to build an intuitive interface for exploration and 
discovery within information collections using HFC; 
we call the resulting interface framework Flamenco [7] 
(flamenco. berkeley.edu). 

This kind of interface allows flexible ways to access 
the contents of the underlying collection. For example, 
from the Meat facet, a user can choose to select the 
Poultry subcategory, and from this select in turn the 
Chicken subcategory. The user can choose any other 
facet, perhaps Dish and Courses, and from this select 
the Pasta category, and then group the resulting recipes 
by Vegetables, or Preparation Type, or any other facet 
(see the accompanying figure). Navigating within the 
hierarchy naturally builds up a complex query that is a 
conjunction of disjunctions over subhierarchies. 

An interface using HFC simultaneously shows pre- 
views of where to go next, and how to return to previ- 
ous states in the exploration, while seamlessly 
integrating free text search within the category struc- 
ture. The approach reduces mental work by promot- 
ing recognition over recall and suggesting logical but 
perhaps unexpected alternatives at every turn, while at 
the same time avoiding empty results sets. This orga- 
nizing structure for results and for subsequent queries 
can act as scaffolding for exploration and discovery. 

We have conducted a series of usability studies that 
find that, for browsing tasks especially, HFC-enabled 
interfaces are overwhelmingly preferred over the stan- 
dard keyword-and-results listing interfaces used in 
Web search engines [7]. Study participants find the 
design easy to understand, flexible, and less likely to 
result in dead ends. 

One drawback of HFC interfaces (as opposed to 
clusters) is that the categories of interest must be 
known in advance, and so important trends in the data 
may not be shown. But by far the greatest drawback is 
the fact that in most cases the category hierarchies are 
built by hand and automated assignment of categories 


to items is only partly successful. 

Our group has recently made some progress in the 
problem of nearly automatic creation of hierarchical 
faceted categories [6]. A portion of the output of the 
system applied to the text of a recipe collection is shown 
in the figure. The algorithm, which makes use of the 
WordNet hierarchy, draws out detailed categories for 
ingredients, dishes, and (unexpectedly) cooking equip- 
ment and people, but misses facets such as cuisine. We 
call the algorithm nearly automated, since the results 
require some editing by hand. There is much room for 
improvement, and we see automatic creation of faceted 
hierarchies as an important area for research. 


IMPACT AND THE FUTURE 

To date, both HFC and clustering are boutique 
search interfaces; they are applied and used primarily 
in domain-specific collections. There are many 
movements afoot to promote larger scale use of meta- 
data more generally. Hierarchical faceted metadata is 
already common in many e-commerce interfaces; for 
example, eBay and Shopping.com are experimenting 
with different variations of the idea, and Endeca.com 
provides a custom solution. It is probably possible to 
automatically impose a faceted structure onto grass- 
roots created tag collections such as those seen at 
Flickr. However, it is an open question whether these 
will eventually be widely and regularly used on the 
open-domain Web. @ 
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By JOHN GERSH, Bessie LEWIS, 
JAIME MONTEMAYOR, CHRISTINE PIATKO, 
and RUSSELL TURNER 


SUPPORTING 


INSIGHT-BASED INFORMATION 


EXPLORATION 


IN INTELLIGENCE ANALYSIS 


Capturing the exploratory search process can 
help represent analytical insight. 


e are interested in the role of exploratory 
search in the intelligence analysis process, espe- 
cially its role in sensemaking: how can explor- 
ing a set of information help an analyst to 
synthesize, understand, and present a 
coherent explanation of what it tells us P 4 
about the world? 
The process of exploratory search can help 
an analyst develop a story implied by relationships 
among discovered information items. One of the key 
challenges in supporting this process is the representa- 
tion, depiction, and recording of insights—the basic elements of analysis. 
We have developed a simple concept for such representations, called “rich 
information collections,” which contain not just the analyst’s search results but 
also an executable collection specification by which similar information can be 
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found, together with the analyst’s rationale for col- 
lecting them. In this article, we describe the nature of 
rich information collections, illustrate how they can 
be used in the process of exploring information rep- 
resented by complex graphs such as social networks, 
and present a general concept for using such collec- 
tions in analysis. 


primary product of intelligence 
analysis is insight; insight is about something, and it is 
based on something (“I think that these three people 
might be planning an attack because this data shows 
they received explosives from a known terrorist.”) 
Descriptions and models of the analysis process often 
include chains of data, evidence, hypotheses, and 
other constructs, in which a collection of lower-level 
information supports a statement (hypothetical or 
real) at some higher level of organization or abstrac- 
tion [1]. Furthermore, the analysis process is inher- 
ently iterative, involving alternating narrowing and 
broadening of focus in a search for information [9]. 
In this context, exploratory search can be character- 
ized by successive queries interspersed with stages of 
sensemaking [10]. At each stage, the analyst’s evolving 
insight determines the nature and utility of additional 
information and the explanatory role of information 
discovered. 

In typical cases, an analyst pulls together a collec- 
tion of interesting nuggets of information from 
reports, databases, open sources, colleagues, and so 
forth. The resulting insight may remain in the ana- 
lyst’s head, it may be recorded in text snippets on 
paper or in computer files, or it may be organized into 
a formal report. Insights are tested against new infor- 
mation over time and are modified or discarded. Cur- 
rent insights may be shared with colleagues and 
insights from the past may be resurrected when simi- 
lar situations arise. The effectiveness of such sharing 


and recall, though, may be limited by the ad hoc 


Al Qaeda‘January 2000 planning meeting in M: 


y Osama bin Laden t 


3 


nature of the records and their Figure 1. Two semantic 
separation from the information as aa Yeap ea 
that generated them. Marchionini necessarily directly con- 
(in this section) characterizes nected. The enlarged 
sus : human icons indicate 
search activities as supporting ssi hiss eniilliens 
lookup, learning, or investigating. belong to multiple 
Exploratory search for developing se eta 
coherent insight about what is 
happening in the world involves 
frequent transitions among all three activities; tools 
for supporting sensemaking should provide a frame- 
work for maintaining context across those transitions. 
An explicit representation within an information sys- 
tem of the insight and its supporting data can help to 
provide that context. 

Testing and modification of ideas can also benefit 
from such a representation. Requiring the analyst to 
record insights in a special software application, 
though, could interrupt the thought process and add 
to workload. What if the information search and 
exploration process itself could be used to record the 
development of the insight? 


A SCENARIO 

Consider the following scenario: Mary, an intelli- 
gence analyst, is interested in the activities of a sus- 
pected terrorist cell. She has populated a local 


What if the information search and 
exploration process itself could be used to 
record the development of the insight? 
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Figure 2a. Semantic navigation 

path control showing the levels of | 
progression of an analyst’s exploratory | 
activities in a social network graph. 
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with meeting attendees and || wy information 
people known to have partici- Name: 

pated in these events. Her Sa 
resulting insight is that these Re avn: 
newly discovered people form Description: 
a group that might carry out a 


plan made at the meeting. She 
creates in her workspace an 
information collection that 
includes these people together 
with a specification of the path of entity and relation- 
ship types she followed to find them while exploring 
the social-network graph (“person that partici- 
pated-in an event that is associated-with a 
person who attended this-meeting”). Mary 
annotates the collection with a description of her 
insight about potential activity. A colleague, Tom, also 
alerted to the meeting, asks Mary if she thinks cell 
members are involved. She sends him her information 
collection and he adds it to his own workspace. When 
Tom looks at it, he is presented with the collection 
depicted as Mary saw it, annotated with her conclu- 
sion. The next day Tom requests an update of the 
information in the collection; its specification of the 
social network path is re-executed to see if any new 
entities in the graph should be added to the collection 
or previous members removed. Note that the key ele- 
ments in this scenario include: the collection of infor- 


Figure 2b. Semantic 
neighborhood description 
and visualization control. 


People involved in events associated with 


Action Team 
Carried Out By 


attendees at January meeting.| 


mation put together by the analyst during exploratory 
search, an executable specification for finding similar 
information, and the annotation describing her 
insight. Our work has led us to consider these ele- 
ments as the components of a richly described infor- 
mation collection that can be used to represent what 
an analyst's insight 7s about, what it is based upon, and 
what it zs. Hence, a rich information collection. 


SEMANTIC NEIGHBORHOODS: AN EXAMPLE OF RICH 
INFORMATION COLLECTIONS 

The concept of a rich information collection has 
evolved from our work on user interactions with 
visual representations of complex conceptual graphs 
(for example, social or contact-chaining networks). 
In particular, we developed “semantic navigation” to 
support user-guided exploration of such graphs [8]. 
Semantic navigation is a constrained approach for 
exploring a graph that results in a collection of enti- 
ties and relationships meaningfully related to one 
another that form what we call a “semantic neigh- 
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borhood” (as opposed to topological “nearest-neigh- 
bors”). Our hypothesis is that meaningful collec- 
tions of entities are likely to be distributed 
throughout the graph, where “meaningful” is rela- 
tive to a particular analyst’s exploration of the infor- 
mation represented. (In the scenario, Mary discovers 
a particular semantic neighborhood.) This hypothe- 
sis has been supported by discussions with and 
demonstrations to former intelligence analysts. 

A semantic neighborhood contains entities whose 
relationship is explained by an analyst’s insight. Figure 
1 shows two semantic neighborhoods in a social-net- 
work graph as depicted in our prototype semantic 
navigation system. (Information in the graph was 
extracted manually from news reports.) One neigh- 
borhood relates to possible implementers of an attack 
that may have been planned at a meeting, as described 
in the scenario here. The other neighborhood relates 
to known members of a terrorist group. Figure 2a 
shows the control panel for specifying the chain of 
entities and relationships that defines one of these 
neighborhoods. Figure 2b shows the control panel for 
selecting neighborhoods to display, selecting attrib- 
utes and modes of display, and entering information 
describing a neighborhood. Visual attributes include 
color, transparency, icon, and size of nodes and links. 
Controls include neighborhood selection, back- 
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Rich Information 


Collecting 
specification 


Figure 3. A generalized picture of a rich 
information collection. The data can be 
collected using an arbitrary number of 
constraints through different methods. 
The data, how it was collected, and a 
descriptive annotation combine to record 
the analyst’s insight. 


Insight: 
Collection 


ground/foreground contrast, and 
selection of representation mode (an 
exploration discovery summary or a 
full elaboration). Displaying multiple 
neighborhoods automatically high- 
lights common members. For exam- 
ple, the two enlarged icons in Figure 
1 represent people that belong to 
both of the depicted neighborhoods. 

The semantic neighborhood is 
one example of a rich information 
collection. Its components are a set of information 
items, a collecting specification, and a descriptive 
annotation, as represented by figures 1, 2a and 2b 
respectively. The information item set is made up of 
individual items appropriate to an analyst’s domain of 
investigation. It might contain images, reports about 
people, organizations, and financial institutions for an 
analyst following a money trail, or even individual 
hypotheses in an analysis of competing hypotheses 
[7]. In fact, the products of one level of analysis might 
be information set members for another level. The 
collecting specification describes how the information 
items were gathered in a format that can be repeated 
to find other such items. It might be a simple database 
query, a chain of relationship types in a social-net- 
work graph, or a place and time for surveillance. The 
annotation is a statement of the analyst’s particular 
insight. In Figure 2b, the annotation is a textual 
description; it could also be, for example, a markup of 
a diagram or picture, or a Web link. The rich infor- 
mation collection thus elaborates upon the raw infor- 
mation it contains with a method for obtaining 
similar information and an explanatory annotation; 
these provide context for understanding and use in 
another situation. 

An important characteristic of rich information 
collections is that they do not constrain the analyst to 


The semantic neighborhood is one example of 
a rich information collection. Its components 
are a set of information items, a collecting 
specification, and a descriptive annotation. 


any particular method for building or modifying sets 
of information items. We are extending our prototype 
system to include interaction capabilities that support 
collection building in ways other than semantic navi- 
gation. These include keyword searches, dynamic 
queries [11], manual inclusion or removal of items, 
and subgraph templates for pattern matching. In gen- 
eral, users will broaden or narrow the search as they 
construct and revise the query pattern, by selecting 
individual collection members, or by limiting the 
timeframe of event nodes using dynamic range slider 
controls. Our research is at an early stage; in addition 
to including a wider repertoire of query mechanisms, 
we are investigating ways for visually organizing sets of 
rich information collections and for explicitly sup- 
porting the changes in insight that occur during the 
analysis process. 


MAKING SENSE WITH RICH INFORMATION COLLECTIONS 
The concept of a rich information collection (see 
Figure 3) can be used to support more general sense- 
making activities. (See [3, 12] for discussions of 
sensemaking.) Bodnar [1, 2], for example, describes 
several stages in analysis, each stage involving infor- 
mation at increasing levels of abstraction: Data 
Source, Shoebox (the loose collection of “what I’m 
working on”), Evidence, Schema, and Theory. A 
similar description appears in [10]. Key points in 
both are that there is a sequence of stages that can be 
traversed either bottom-up (data-driven) or top- 
down (hypothesis-driven) in which the products of 
one stage are the input to the next; and that the 
process is iterative—one can actually go from almost 
any stage to any other. 

A rich information collection can be used to main- 
tain context while traversing these stages. A collection 
at one level may represent an entity at the next. For 
example, Mary's analysis scenario describes the con- 
struction of a collection that, in fact, represents a 
potential terrorist action. The collection itself can 
form an information element in the following step. 
The existence of the cell, for example, could be an 
item of evidence supporting a schema describing 
potential activity. The rich information collection put 
together by the analyst represents the analyst's insight 
that this item of evidence is valid. An individual ana- 
lyst could keep a set of rich information collections in 
a local workspace, updating them, relating them, and 
modifying them as the process of analysis continues. 
A rich information collection may also represent a 
hypothesis, especially if it involves hypothetical enti- 
ties or relationships. A user-created information ele- 
ment (“I think these two people are linked.”) can be 
marked as hypothetical. Executing a rich information 


collection’s collecting specification could serve as a 
search for the existence of the hypothetical entity. 
Diagnostic evidence between competing hypotheses 
could also be represented in this way. 

The presence of the individual information items 
and the collecting specification in the rich informa- 
tion collection means it is “live” for users with access 
to the information items. Changes in the collection, 
due to the addition, modification, or deletion of 
items, can be immediately linked to representations of 
the collection itself: This can enable an active chain in 
which the representation of analytic assessment in a 
user interface highlights in response to changes in a 
data item relating to a piece of evidence on which the 
assessment is based. 


ther concepts of dynamic informa- 
tion sets that include exploratory queries or query 
sequences are e-sets in the NaviQue information gath- 
ering and organizing environment [5] and dynamic 
aggregates in the Visage visual query environment [4]. 
E-sets are sets of items (data or executable queries, for 
example) together with metadata and a user interface 
widget for viewing the sets. Dynamic aggregates are 
information sets whose contents are specified in part 
by an associated query sequence. Our approach to 
rich information collections expands on the basic con- 
cept by providing a set of operations for graph navi- 
gation and a generalizable annotation mechanism. 
The approach also provides an information architec- 
ture based on a central role rich information collec- 
tions have in the sensemaking process. 


SOFTWARE ARCHITECTURE FOR RICH INFORMATION 
COLLECTIONS 

We have developed a software architecture for build- 
ing interactive visual applications that create and use 
rich information collections. The architecture 
defines abstract interfaces to both a data model and 
a user interface framework, allowing data model and 
user interface implementations to vary. The abstract 
data model represents a network consisting of nodes, 
that represent entities such as persons, places, or 
organizations, and connecting edges that represent 
relationships between entities. The nodes can be 
arbitrarily connected via edges to form a directed 
graph data structure. Information in the form of 
text, numerical values, or other data types can be 
associated with individual entities and relationships 
by attaching arbitrary attributes to the nodes and 
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edges in the graph. To support collections and 
semantic navigation, the data model defines sets of 
nodes and edges. 

Sets can also have arbitrary attributes, providing 
for the association of query sequences and annota- 
tions with collections. Sets may contain other sets 
providing a mechanism for supporting relationships 
among sensemaking stages. The software architecture 
contains a plug-in mechanism for assembling compo- 
nents of visualization functionality into a user-inter- 
face framework. These visualization components 
communicate with each other via the semantic net- 
work data model using an event passing mechanism 
that follows a model-view-controller design pattern. 
We are using this framework to develop additional 
user interface mechanisms for interacting with infor- 
mation elements and rich information collections to 
support analysts’ exploratory search and sensemaking 
activities. 


CONCLUSION 
To paraphrase Hamming [6], the purpose of 
exploratory search is insight, not data. In intelli- 
gence analysis, as in other domains, that insight 
comes from the process of exploration, not just from 
its end result. We are interested in capturing and 
visually representing analysts’ iterative query 
processes and insights to help them collect and com- 
pare information more effectively, as well as record 
and share the products of their analytic insights. 
Our early work has focused on insight-based infor- 
mation exploration of social networks. We are continu- 
ing to develop and evaluate further exploratory search 
interaction and visualization mechanisms to create, com- 
pare, and share rich information collections. We believe 
that such mechanisms will help analysts’ exploratory 
search activities to be more effective for sensemaking and 


for sharing their insights with others. @ 
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FIND THAT PHOTO! 


Interface Strategies to Annotate, 
Browse, and Share 


By BEN SHNEIDERMAN, BENJAMIN B. BEDERSON, 
and STEVEN M. DRUCKER 


s digital photos become the 
standard media for personal photo taking, 
supporting users to explore those photos 
becomes a vital goal. Dominant strategies 
that have emerged involve innovative user 
interfaces that support annotation, browsing, 
and sharing that add up to rich support for 
exploratory search. Successful retrieval is 
based largely on attaching appropriate anno- 
tations to each image and collection since 
automated image content analysis is still lim- 
ited. Therefore, innovative techniques, novel 
hardware, and social strategies have been pro- 
posed. Interactive visualization to select and 
view dozens or hundreds of photos extracted 
from tens of thousands has become a popular 
strategy. And since the goal of photo search is 
to support sharing, storytelling, and remi- 
niscing, experiments with new collaborative 
strategies are being examined. 

While digital photographic databases and 
retrieval systems have been in use for many 
years, these systems were typically designed for 


professionals in museums, libraries, advertis- 
ing, and journalism, to name a few specialities. 
Such systems employed a cadre of financially 
motivated individuals to hand-annotate the 
pictures with metadata such as keywords, 
dates, and locations, often using fixed vocabu- 
laries, to support traditional search techniques. 
By contrast, consumers typically put little 
effort into photo annotation; they are more 
focused on exploratory search and serendipi- 
tous discovery of photos with a stronger 
emphasis on entertainment. This leads to a 
very different set of requirements for personal 
photo use where ease of annotation, support 
for exploratory browsing, and convenient 
sharing is crucial. 
Annotate. In textual exploratory search, 

users can enter key phrases from a docu- 


ment to retrieve similar content. But for 
images, retrieval based on content a? 


through automated analysis is often 
limited to some forms of shape analysis 
(such as finding the presence of faces in 
an image) and color matching to find sunrises 
or determine whether an image was taken 
inside or outside. 

To support effective exploratory search on 
photos, appropriate annotations must be asso- 
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ciated with the images either by the camera or by 
users of the images, such as the photographer or 
potentially a larger community of users. Cameras are 
increasingly recording information about the photo- 
graph including time and date stamps, tilt sensors for 
orientation, light levels, focal distances, and even 
global position. Barcodes, RFID tags, or other label- 
ing methods could enable a higher percentage of pho- 
tos to be annotated automatically. 


any interfaces enable manual 
annotation of photographs by “painting” keywords 
[3] or dragging and dropping names onto images. 
Commercial tools such as Adobe PhotoShop Album 
make tags drag-able onto photo borders. Other tools 
perform temporal clustering to create a more man- 
ageable set of photo groups [1]. As with many tasks, 
manual annotation can be improved by designing 
interfaces that support faster and easier annotation as 


entertainment. 
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well as making the future 
benefits more apparent. 
Automatic and manual 
annotations are valuable in 
supporting both searching 
and browsing. 

Browse. Users browse 
for fun and to find a spe- 
cific photograph. They 
may be looking for photos 
of their grandfather, their 
hike down the Grand 
Canyon, or a friend’s wed- 
ding. They also may be 
looking for a great photo to 
accompany a story of a sun- 
rise hike or memorable 
baseball game. 

Clearly, if the photo col- 
lection has been extensively 
annotated, techniques such 
Figure 1. The WorldWide Media as faceted search (see 
Exchange (WWMX) interface =F -arst’s article in this sec- 
showing map and calendar ‘ 
views along with images as tion) can help users filter 
published in ACM Multimedia down a collection and show 
2003; wwmx.org. . 

potential targets for brows- 
ing. User-controlled visualization of photos grouped 
by date, location, or annotation can greatly facilitate 
browsing and increase enjoyment [4]. Different lay- 
outs of photos can exploit this metadata to help peo- 
ple find desired photos and discover new ones. In 
particular, geo-tagging of photos and interfaces, like 
WWMxX, allows people to find all those photographs 
associated with a particular area (see Figure 1). 

Chronological displays work well for dates as well, 
but large numbers of photos can be overwhelming, so 
groups of photos can be clustered by date and represen- 
tative photos can be manually or automatically chosen 
for each cluster [1, 2]. These representative photos 
again help to provide landmarks in order for users to 
locate photos from particular events. Interfaces such as 
PhotoMesa use powerful filtering tools, plus flexible 
grouping and rapid zooming, to enable users to explore 
thousands of photos fluidly (see Figure 2). 


mm ey iry ur 
ie ad = 


r = a 
is = & 


Consumers typically put little effort into 
photo annotation; they are more focused on 
exploratory search and serendipitous discovery 
of photos with a stronger emphasis on 
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Figure 2. PhotoMesa showing 114 
photos in six groups in a single 
view with integrated annotation 
and search tools as published in 
ACM UIST 2001; (courtesy of 
www.photomesa.com). 


Share. Sharing photos 
by email, instant messag- 
ing, Web sites, and cell 
phones is a growing suc- 
cess story. When_ users 
select photos and make them available to others, they 
seem to be willing to invest more effort in annota- 
tion. Also by making them public, they invite others 
to comment and add annotations. More elaborate 
story-generating tools invite users to provide 
slideshow sequences with text captions and audio 
narration. 

Recent innovations in social experiences on the 
Web have sought to encourage annotation by increas- 
ing satisfaction and making the benefits immediately 
apparent. A game-like approach to image annotation 
gets players to cooperate with anonymous, remotely 
located partners in assigning keywords for pho- 
tographs [5]. This surprisingly addictive game has 
succeeded in labeling over 10 million images as of 
August 2005 (since its introduction in 2003). Other 
communities, such as Flickr, allow users to share and 
annotate images on a Web site using tags. These 
“folksonomies” have now gone past photos to Web 
pages and blogs as well (such as technorati and 
deli.cio.us). 

The trend toward annotating, browsing, and sharing 
your photos via Web sites such as Flickr, Ofoto, and 
Shutterfly is perhaps one of the biggest changes enabled 
by the transformation from analog to digital photogra- 
phy. Photos no longer sit unattended in shoeboxes 
stored in attics, but are available for ready viewing by 
friends and family distributed around the world. 
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SUMMARY 
A combination of annota- 
tion, browsing, and sharing 
of photos can support the 
special exploratory search 
needs of personal digital 
photo users by getting 
around the fact that direct 
search of image content con- 
tinues to be beyond the capa- 
bilities of current systems. 
The special needs of 
amateur digital photogra- 
phers are pushing the photo 
industry to support users 
with their desired activities. 
Social networking, in com- 
bination with innovative 
user interfaces and visualiza- 
tion, is just beginning to 
support everyday photogra- 
phers. However, we see significant work remaining, 
especially in metadata standardization to help users 
cope with their rapidly growing and increasingly val- 
ued collections. 
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USING TEMPORAL PATTERNS 
OF INTERACTIONS to 
Design Effective Automated 
Searching Assistance 


By BERNARD J. JANSEN 


eb search engines effec- 
tively support searching of short queries, 
with session durations of typically 15 min- 
utes, while providing limited searching assis- 
tance to the user. However, Web search 
engines are less effective in the more com- 
plex searching situations in which users lack 
the domain knowledge or contextual aware- 
ness to use the system effectively. User uncer- 
tainty about the information needs, the 
content space, or the system’s capabilities 
typify these types of searches. The interac- 
tions among searcher, system, and content 
are more multifaceted. Current research 
indicates that approximately 15% of user 
sessions on the Web are more complex 
searching episodes [3]. 

Assistance from the search engine can 
improve the searching experience for the user. 
There has been considerable work on develop- 
ing contextual help and recommender systems 
to aid searchers. However, prior work has 
noted that searchers seldom utilize this sup- 
port, resulting in ineffective or inefficient 
searches. Issues hindering the use of these 
automated assistance systems may result from 
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a lack of understanding, on the part of the sys- 
tem, about the information task, and when 
users desire assistance. The negative aspects of 
task interruption for searchers may negate 
whatever benefits the assistance provides. 

Prior research indicates there are typical 
patterns of interactions between searchers and 
systems [4]. It would seem reasonable to lever- 
age these patterns to provide searching assis- 
tance to aid the user, with the goal of 
improving the outcome of the search process. 
However, beyond simple assistance (for exam- 
ple, spelling corrections for query terms), the 
results have been mixed [1]. Certainly, the 
cognitive load of information seeking in these 
contextual situations is high. The interjection 
of assistance into the search process may be 
too much of a cognitive load, requiring a task 
switch from focusing on the search process to 
mentally processing the assistance. Therefore, 
searchers ignore or improperly implement the 
assistance. 

Some researchers are investigating an 
approach that is more synergistic with the 
search process. Assistance is offered when the 
searcher is most receptive to it. This is in con- 
trast to system intervention at every step in the 
search process or when the system deems there 
is a need for assistance. Assistance synchro- 
nized with the user’s searching process helps 
reduces the cognitive effort required for task 


switching and can increase the imple- 
mentation of system assistance. 

We have conducted a_ series 
of research experiments examining 
temporal patterns of interaction 
between searchers and system assis- 
tance. Specifically, we gauged the tem- 
poral interactions of when searchers 
seek and implement searching assis- 
tance. Results from these investigations 
on searching assistance indicate that patterns 
of user-system interaction styles are short, typ- 
ically only two or three interactions. Previous 
research also supports this finding [2, 4]. 

In our experiments, we have identified a tax- 
onomy of 26 interactions, which we categorized 
into nine groups (see the accompanying table). 
The majority of interactions are with result list- 
ings (approximately 20%) and selection of doc- 
uments (approximately 15%). Interactions with 
system assistance are 
approximately 4%, 
Searchers typically seek out 
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We evaluated the Taxonomy of user-system 
effectiveness of auto- interactions: 
mated assistance in a 
user study with complex searching tasks, 
namely the National Institute of Standard and 
Technology's Text Retrieval Conference 
searching topics and content collections. We 
compared a system with pattern-based search- 
ing assistance to a system with assistance pro- 
vided at every point in the search process. 
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query syntax, and error 
detection) at critical points 
in the search process when 
searchers have the highest 
probability of being recep- automated-assistance 
tive to this intervention. displayed in 
The application integrates promear incon 
with the searcher’s browser (see the figure). 

The application relies solely on searcher 
interactions for determining intervention tim- 
ing and type of assistance. The application rec- 
ognizes the interaction patterns of the current 
searcher, comparing them to preprogrammed 
patterns. Assistance is provided when a match 
between these patterns occurs. The application 
is client-side, so the searcher is not limited to 
one search engine. 


2 CRIZE-5212 
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Results based on the number of relevant doc- 
uments identified indicated that searching 
assistance synchronized with the manner in 
which users search can improve the perfor- 
mance of the searching process. In 70% of the 
cases, searchers on the system with the pat- 
tern-based searching assistance performed 
better than searchers on the other system. 
Results also indicate that users most com- 
monly implemented assistance to manage 
results (38% of implementations), over- 
whelmingly acting to restrict the query. This 
signifies that precision is a key issue for 
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In 70% of the cases, searchers on the 
system with the pattern-based searching 
assistance performed better than searchers 
on the other system. 


exploratory searching. 2. Chen, H.-M. and Cooper, M.D. Stochastic modeling of usage patterns 


: cements in a Web-based information system. /. American Society for IS and Tech. 
The implications are that one can develop systems —_53 (2002), 536-548. of 


to support exploratory searching, and one can design 3. Jansen, B.J. and Spink, A. How are we searching the World Wide Web? 


A comparison of nine search engine transaction logs. /nformation Pro- 
these systems so that searchers use them. The next cessing & Management 42 (2005), 248-263. 


step is personalizing these systems at the individual 4. Qiu, L. Markov models of search state patterns in a hypertext informa- 
level, based on user preferences and searching tactics, 40" tettieval system. J. Amerioam Society for IS 44 (1993), 413-427. 
Effectively designing Web searching systems to assist 
the searcher can improve exploratory searching in 
a variety of domains including e-commerce, data 
analysis, and competitive intelligence. 


BERNARD J. JANSEN (jjansen@acm.org) is an assistant professor in 
the School of Information Sciences and Technology, The Pennsylvania 
State University, University Park, PA. 


Permission to make digital or hard copies of all or part of this work for personal or class- 

room use is granted without fee provided that copies are not made or distributed for 

REFERENCES profit or commercial advantage and that copies bear this notice and the full citation on 

1. Anick, P. Using terminological feedback for Web search refinement—A the first page. To copy otherwise, to republish, to post on servers or to redistribute to 
log-based study. In Proceedings of the 26th Annual International ACM lists, requires prior specific permission and/or a fee. 

SIGIR Conference on Research and Development in Information Retrieval. 

(Toronto, Canada. July 28—Aug. 1, 2003), 88-95. © 2006 ACM 0001-0782/06/0400 $5.00 


ACM is Pleased to Announce a NEW Publication! 


ACM ; 
Journal on : PRODUCT INFORMATION 


Emerging Technologies JETC covers research and develop- ISSN: 1550-4832 
in Computing Systems ment in emerging technologies in Order Code: 154 
computing systems. Major economic Price: $40 Professional Member 
and technical challenges are expect- $35 Student Member 
ed to impede the continued scaling $140 Non-Member 
of semiconductor devices. This has $14 Air Service (for residents 


resulted in the search for alternate outside North America only) 
mechanical, biological/ biochemical, 


nanoscale electronic, and quantum 

computing and sensor technologies. 
As the underlying nanotechnologies 
continue to evolve, it has become 
imperative for computer scientists Phone: 1.800.342.6626 (U.S. and Canada) 
and engineers to translate the poten- +1.212.626.0500 (Global) 

tial of the basic building blocks Fax: +1.212.944.1318 

(analogous to the transistor) into (Hours: 8:30am—4:30pm, Eastern Time) 
information systems. Email: acmhelp@hq.acm.org 


Association for Computing Machinery ORDER TODAY! Mail: sag lpr eo: 
q OX 


Advancing Computing as a Science & Profession 


www.acm.org New York, NY 10286-1414 USA 


Please contact ACM Member Services: 


pubs/jet 


= 


74 = April 2006/Vol. 49, No. 4 COMMUNICATIONS OF THE ACM 


a CALL FOR CONTRIBUTIONS 


22nd IEEE International Conference 


on Software Maintenance 
September 24-27 2006 


Important Dates 


March 31, 2006 
April 30, 2006 
June 2, 2006 
July 14,2006 


4 


Research papers due: 
Other proposals due: 
Author Notifications: 
Camera-ready copy due: 


Sheraton Society Hill 
Philadelphia, Pennsylvania, USA 


The International Conference on Software Maintenance (ICSM) i is the world's premiere 
international conference for software and systems maintenance, evolution, and bap sacle 
We invite you to join us in exploring issues related to maintaining, modifying, en ing, and 
testing operational systems and designing, building, testing, and evolving maintainable 
systems. ICSM 2006 will address the major oh auc confronting maintenance and evolution. 


echnical Papers 
Research papers should be original and significant 
orks addressing software maintenance and 
evolution in research and practice. Case studies, 
empirical research, and well designed experiments 
are particularly welcome. 


ndustrial Applications 

Proposals for presentations of industrial applications 
are welcome. These can be state-of-the-art 
descriptions, experience and survey reports 


om real projects, industrial practices, and models 
Panels 
ICSM 2006 invites Is for panel sessions that 


serve to stimulate discussion about ideas and 
issues of importance to the software maintenance 
community. 


Poster Session 


We invite the submission of posters for displa hey 
discussion during an evening pasion at | 
Poster presentations provide an opportunity for new 
and highly interactive ideas to be presented in a 
more casual setting. 


Generous Donors: 


Tool Demonstrations 


Commercial and research tools are also 
welcome. The goal of the tool demonstration 
session is to let conference participants view 
software maintenance tools in action and 
discuss these systems with their creators. 


Dissertation Forum 


Submissions of researchers that are planning to 
deliver their dissertation in the next year or have 
delivered their dissertation (PhD) in the last 
three years are welcome. 


Working Sessions 

We invite proposals for working sessions, which 
focus on issues and are designed to generate 
strong discussion. The goal is to present a 
roadmap for an area emphasizing practical 
application. 


Short rs 


Short papers allow authors the opportunity to 
present new ideas or a gem of information in 
order to stimulate discussion. Each accepted 
paper will be allocated a 10 minute presentation 
and included in the conference proceedings. 


Co-located Events: 


=: WSE 2006 (Web Site Evolution) 
ene eee ee E BERT www.websiteevolution.org/2006/ 
* 


Liversity 


4 Drexel U: 


SCAM 2006 (Source Code 
Analysis and Manipulation) 
www.dcs.kcl.ac.uk/staffimark/ 
scam 


General 
Chair 


Spiros Mancoridis 
Drexel University, USA 


David Binkley 
Loyola College i in Maryland, USA 


Rainer Koschke 
University of Bremen, Germany 


Michael W. codex 

University of Waterloo, Canada 

Leon Moonen 

Delft Univ. of Technology Netherlands 


Sue Black 
London South Bank University, UK 


Daniel Waddington 

Lockheed Martin ATL, USA 
Mark Harman 

King's College of London, UK 
Andrew J. Malton 

University of Waterloo, Canada 


Vassilios Tzerpos 
York University, Canada 


Jens Krinke 

University of Hagen, Germany 
Claudia Werner 

Federal Univ. of Rio de Janeiro, Brazil 


Filip |. Vokolos 
Drexel! University, USA 


Massimilliano Di Penta 
University of Sannio, Italy 


Panos Linos 

Butler University, USA 
Brian S. Mitchell 
Drexel University, USA 


Jay Kothari 
Drexel University, USA 


For more information, and to 


make submissions, please visit: 


icsm2006.cs.drexel.edu 


IEEE Computer Society ® 


FROM FINGERPRINT | 
TO WRITEPRINT ‘| 


Tdentifying the key features to help toentify and 


By JIEXUN LI, 
RONG ZHENG, and 
HSINCHUN CHEN 


ILLUSTRATION 
BY PAUL WILEY 


trace online authorship. 


Fingerprint-based identification has been the oldest biometric technique suc- 
cessfully used in conventional crime investigation. The unique, immutable 
patterns of a fingerprint—the pattern of ridges and furrows as well as the 
minutiae points—can help a crime investigator infer the identities of suspects. 


However, circumstances have changed 
since the emergence and rapid proliferation 
of cybercrime. Generally, cybercrime 
includes Internet fraud, computer hack- 
ing/network intrusion, cyber piracy, 
spreading of malicious code, and so on. 
Cyber criminals post online messages over 
various Web-based channels to distribute 
illegal materials, including pirated software, 
child pornography, and stolen property. 
Moreover, international criminals and ter- 
rorist organizations such as Osama bin 
Laden and Al Qaeda use online messages as 
one of their major communication media. 
Since people are not usually required to 
provide their real identity in cyberspace, the 
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anonymity makes identity tracing a critical 
problem in cybercrime investigation. This 
problem is further complicated by the sheer 
amount of cyber users and activities. 
Unlike conventional crimes, there are no 
fingerprints to be found in cybercrime. For- 
tunately, there is another type of print, 
which we call “writeprint,” hidden in peo- 
ple’s writings. Similar to a fingerprint, a 
writeprint is composed of multiple features, 
such as vocabulary richness, length of sen- 
tence, use of function words, layout of 
paragraphs, and keywords. These 
writeprint features can represent an author's 
writing style, which is usually consistent 
across his or her writings, and further 
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become the basis of authorship analysis. In this arti- 
cle, we introduce a method of identifying the key 
writeprint features for authors of online messages to 
facilitate identity tracing in cybercrime investigation. 


AUTHORSHIP ANALYSIS AND WRITEPRINT FEATURES 
Authorship analysis is a process of categorizing arti- 
cles by authors’ writing style and is often viewed in 
the context of stylometric research [7]. It has its 
most extensive applications to historical literature 
[4, 8]. Some recent studies introduced this 
approach to online messages and showed promising 
results [3]. This research field can be categorized 
into authorship identification, authorship charac- 
terization, and similarity detection. Authorship 
characterization is aimed at inferring an author’s 
background characteristics rather than identity. 
Similarity detection compares multiple pieces of 
writing without identifying the author. Authorship 
identification determines the likelihood of a partic- 
ular author having written a piece of work by exam- 
ining other works produced by that author. In this 
study, we are particularly interested in authorship 
identification because it is the most relevant to 
cybercrime investigation. 

The essence of authorship identification is to iden- 
tify a set of features that remain relatively constant 
among a number of writings by a particular author. 
Given n predefined features, each piece of writing can 
be represented by an 7-D feature vector. Supervised 
learning techniques such as ID3, Neural Network 
(NN), and Support Vector Machine (SVM) can train 
and generate a classifier so as to determine the cate- 
gory of a new vector, that is, the authorship of an 
anonymous writing. In such a process, the classifica- 
tion technique is very important to the performance 
of authorship identification. Support Vector Machine 
has been frequently used in previous authorship iden- 
tification studies [3, 12]. We have shown in our pre- 
vious work [12] that SVM outperforms other 
classification methods such as decision tree and neural 


network for authorship identification of online mes- 
sages. In addition, the predefined feature set is 
another crucial factor. The writeprint features pro- 
posed in previous literature can be divided into four 
types, as described here. 

Lexical features, the earliest features used in 
authorship analysis, represent an author's lexicon- 
related writing styles. Most of them are character- 
based and word-based features. For example, Elliot 
and Valenza conducted modal testing based on word 
usage to compare the poems of Shakespeare with 
those of Edward de Vere, the leading candidate as 
the true author of the works credited to Shakespeare 
[4]. In Yule’s early work some more generic features 
were employed, such as sentence length and vocab- 
ulary richness [11]. Later, Burrows developed a set 
of more than 50 high-frequency words, which were 
tested on the Federalist Papers [2]. Holmes analyzed 
the use of “shorter” words (two- or three-letter 
words) and “vowel words” (words beginning with a 
vowel) [5]. 

Syntactic features, including punctuation and func- 
tion words, can capture an author’s writing style at the 
sentence level. They are often “content-free” features 
derived from people’s personal habits of organizing 
sentences. In the seminal work conducted by 
Mosteller and Wallace [8], they first used the fre- 
quency of occurrence of 30 function words (such as 
“while” and “upon”) to clarify the disputed work, the 
Federalist Papers. Subsequently different function 
words were examined and showed good discriminat- 
ing capability [5]. Baayen et al. concluded that incor- 
porating punctuation frequency as a feature can 
improve the performance of authorship identification 
[1]. Stamatatos et al. introduced more complex syn- 
tactic features such as passive count and part-of- 
speech tags [10]. These studies demonstrated that 
syntactic features might be more reliable than lexical 
features in authorship identification. 

Structural features, in general, represent the author's 
habits when organizing a piece of writing. Habits 


The essence of authorship identification is to IDENTIFY A 
SET OF FEATURES that remain relatively constant among 


a number of writings by a particular author. 
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such as paragraph length, use of indentation, and use 
of signature can be strong authorial evidence of per- 
sonal writing style. Structural layout traits and other 
features have been introduced by de Vel et al. for email 
author identification and achieved high identification 
performance [3]. 

Content-specific features refer to keywords in a spe- 
cific topic. Although seldom used in previous studies, 
these features could complement “content-free” fea- 
tures to improve the performance of authorship iden- 
tification for particular applications. In the 
cybercrime context, a cyber criminal often posts ille- 
gal messages involving a relatively small range of top- 
ics, such as pirated software or child pornography. 
Hence, special words or 
phrases closely related to 


may be irrelevant or redundant, hence reducing the 
prediction accuracy. For instance, de Vel et al. 
observed a reduction in performance when the num- 
ber of function word features was increased from 122 
to 320 [3]. Feature selection should be undertaken to 
remove features that do not contribute to prediction 
[3, 5]. To our best knowledge, however, few studies 
have been conducted to select key features for author- 
ship identification at the individual feature level. In 
addition, extracting a large number of features from 
online messages is time consuming and may induce 
errors. Therefore, it is important to identify the key 
writeprint features for authorship identification of 
online messages. Due to the multilingual characteris- 
tic of online messages, in 
this article we study 
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guages may share similar 
writeprint features, such 
as structural features. 
However, due to the 
uniqueness of language, 
some features are not generic. For example, while 
most Western languages have boundaries between 
words, most Oriental languages do not. In addition, 
different languages can have different function words 
and word-based features. Rudman summarized 
almost 1,000 writeprint features for English used in 
authorship analysis applications [9]. Based on previ- 
ous literature, Zheng et al. created a taxonomy of 
writeprint features for online messages, including 270 
features for English (87 lexical features, 158 syntactic 
features, 14 structural features, and 11 content- 
specific features) and 114 for Chinese (16 lexical fea- 
tures, 77 syntactic features, 11 structural features, and 
10 content-specific features) [12]. 

A number of studies have shown the discriminating 
power of different types of features. Furthermore, 
researchers attempt to identify an optimal set of fea- 
tures for authorship identification. Most previous 
studies of feature choice compare different types of 
features. Even if a type of feature is effective for 
authorship identification, some features in this type 


Figure 1. The process of 
GA-based feature selection. 


niques aim to select a sub- 
set of features relevant to 
the target concept: writeprint in this study. There 
are a variety of well-developed methods in the pat- 
tern recognition and data mining domains to iden- 
tify important features. Liu and Motoda 
summarized past studies of feature selection into a 
general framework [6]. The process of feature selec- 
tion can be viewed as a search problem in feature 
space. Exhaustive search and heuristic search are 
two major search strategies. Exhaustive search tries 
every feature combination to achieve the optima 
but is computationally infeasible for large feature 
sets. Heuristic search uses certain rules to guide the 
direction of the search. This search strategy reduces 
the size of the search space and therefore speeds up 
the process significantly. The major heuristic search 
algorithms include hill climbing, best-first search, 
and generic algorithm (GA). GA behaves like a 
metaphor of the processes of evolution in nature. 
The optimal solution, that is the chromosome with 
the highest fitness value, can be achieved via a num- 
ber of generations by applying genetic operators 
such as selection, crossover, and mutation. GA can 
avoid local optima and provide multi-criteria opti- 
mization functions. 
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A GA-BASED FEATURE SELECTION ja 
MobDEL a. 
Here, we propose a GA-based fea- 
ture selection model to identify 3%" 
writeprint features. In sucha model 3 40. 
each chromosome represents a fea- 
ture subset, where its length is the 10 
total number of candidate features * O80 
and each bit indicates whether a 
feature is selected or not. Specifi- 
cally, 1 represents a selected feature 
while 0 represents a discarded one. 
For example, a chromosome repre- 
senting five candidate features, 
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Number of Features 
re} 


fourth, and fifth features are | =z aa a 
selected, while the other two are % : e 100 150 pcan 350 400 450 0 -350/-100 150 Pn 350 400 450 
discarded. In the first generation (a) English (b) Chinese 

each bit of chromosomes is 

assigned to 0, meaning that none 

of the features is selected. Each Figure 2. topics (such as movies, music, and novels). For each 


Experiment results 
of feature selection 
on English and 
Chinese testbeds. 


chromosome, that is, a feature sub- 
set, can be employed to train a clas- 
sifier. Thus, the fitness value of 
each chromosome is defined as the 
accuracy of the corresponding classifier. By applying 
genetic operators in the successive generations, the 
GA model can generate different combinations of 
features to achieve the highest fitness value. There- 
fore, the feature subset corresponding to the highest 
accuracy of classification 
along all the generations 
is regarded as the opti- 
mum. The selected fea- 
tures in this subset are 
the key writeprint fea- 
tures to discriminate the 
writing styles of differ- 
ent authors. The process of this GA-based. feature 
selection is shown in Figure 1. 


270 
134 


Full set 
Optimal subset 


EXPERIMENTAL STUDIES 

Data Collection. To test the feasibility of the author- 
ship identification and to identify the key writeprint 
features for online messages, two testbeds of online 
messages (English and Chinese) were created. Since 
illegal online messages are of particular interest in this 
study, we collected messages that were involved in 
selling pirated software/CDs from misc.forsale.com- 
puters.* (including 27 sub-groups) in Google news- 
groups to create the English testbed. The Chinese 
testbed is composed of online messages from the two 
most popular Chinese Bulletin Board Systems 
(smth.org and mitbbs.net), involving seven different 
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English 


of the testbeds, we identified 10 of the most active 
authors with 30-40 messages collected for each of 
them. The average length of the messages written by 
each author is 169 words for the English testbed and 
807 characters for the Chinese testbed. Based on our 
previous study [12], in total, 270 and 114 features are 
predefined and extracted from the English and Chi- 
nese messages, respectively. For online messages with 
such short length, when the full set of features are 
used, a sample size 
of approximately 30 
messages per author is 


97.85% 0002 00417 necessary to predict 
99.01% 0.001 ee 
ee authorship with an accu- 
249% 0023 01270 ~~ racy of 80%—90% [12]. 


oe Experimental Results. 
We applied the GA- 
based feature selection 
model on both the Eng- 
lish and Chinese testbeds. 
Due to its good performance, the Support Vector 
Machine algorithm was selected to generate the clas- 
sification model. The feature selection process started 
with an empty feature set: no feature was used. In the 
successive generations, the GA model conducted a 
global search for the optimal feature subset by apply- 
ing the crossover and mutation operators. We 
observed a significant increase of fitness value (classi- 
fication accuracy) in a number of early generations 
and a relatively constant accuracy afterward. Mean- 
while, the number of selected features in the best 
chromosome of each generation increased from 0 to 


about half of the full feature set. Figure 2 shows the 


Table 1. Comparison between 
full feature set and optimal 
feature subset. 


change of accuracy and the num- 
ber of selected features along 500 
generations for the two testbeds. 
The evolutionary process of GA 
converged after approximately 50 
generations for the English 
dataset and 120 generations for 
the Chinese dataset. Among all 
the chromosomes in the 500 gen- 
erations, the one with the highest 
accuracy corresponded to the 
optimal feature subset. 

To compare the discriminating 
power of the full feature set and 
the optimal set, 30-fold pair-wise t-tests were con- 
ducted respectively for the English and Chinese 
datasets. As shown in Table 1, the GA-based model 
identified a feature subset with approximately only 
half of the full set as the key features: 134 out of 270 
for English; 56 out of 114 for Chinese. For the Eng- 
lish dataset, the optimal feature set achieved a classifi- 
cation accuracy of 99.01%, which is significantly 
higher than 97.85% achieved by the full set (p-value 
= 0.0417). For the Chinese 
dataset, the optimal feature set 
achieved a classification accuracy 


Table 2. Illustration of 
key writeprint features. 


of 93.56%, which is higher than __ 08- 
92.42% achieved by the full set 94. 
but not significantly (p-value = 4, 
0.1270). In general, using the - 


optimal feature subset, we can 
achieve a comparable (if not 
higher) accuracy of authorship 
identification. 

The effect of feature selection is 
significant and promising. Fur- 
thermore, we discovered that the 
selected key feature subset 
included all four types of features. 
This is consistent with our previous study in [12], 
which showed that each type of feature contributes to 
the predictive power of the classification model. In 


Figure 3. Comparison 
of writeprints between 
three authors. 
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particular, the relatively high proportion of selected 
structural and content-specific features suggests their 
useful discriminating power for online messages. 
Table 2 illustrates several key features identified from 
the full feature set. 

The results from Table 2 have some interesting 
implications. Since some features in the full feature set 
may be irrelevant for online messages, the frequency 
of characters related to online messages (such as “@,” 
“$”) instead of other common ones (such as “A,” “E”) 
were selected. In addition, since some features may 


| Yule's K i ie 
Frequency of i 
Frequency of 
No. of of sntancesfparagraph 


ayy 
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only provide redundant information, the total num- 
ber of uppercase letters/total number of characters was 
identified as a key feature, while the frequency of 
lowercase letters was discarded. Similarly, only one 
vocabulary richness measure, for example, Yule’s K or 
Honore’s R, was selected and others were ignored. 
Since online messages are often short in length and 


The relatively high proportion of selected structural and 


content-specific features suggests their USEFUL 
DISCRIMINATING POWER for online messages. 
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These selected key features of writeprint CAN 


EFFECTIVELY REPRESENT THE DISTINCT 
WRITING STYLE of each author and further assist us in 


identifying the authorship of new messages. 


flexible in style, structural layout traits such as the 
average length of paragraphs became more useful. In 
addition, content-specific features are highly related 
to their context. Therefore, features such as “sale” and 
“check” were identified as the key content-specific 
features for the English dataset based on sales of 
pirated software/CDs. In other contexts, different 
content-specific features should be identified and 
used accordingly. 

These selected key features of writeprint can effec- 
tively represent the distinct writing style of each 
author and further assist us in identifying the author- 
ship of new messages. Figure 3 exemplifies a compar- 
ison of writeprints between three authors in the 
English dataset using five of the key features, where 
feature values were normalized to [0, 1]. Clearly, 
Mike’s distinct writeprint from the other two indi- 
cates his unique identity. The high degree of similar- 
ity between the writeprints of Joe and Roy suggests 
these two IDs might be the same person. 


CONCLUSION 

The absence of fingerprints in cyberspace leads law 
enforcement and intelligence community to seek 
new approaches to trace criminal identity in cyber- 
crime investigation. To address this problem, we 
propose to develop writeprint to help identify an 
author in cyberspace. We developed a GA-based fea- 
ture selection model to identify the key features of 
writeprint specifically for online messages. Experi- 
mental studies on English and Chinese testbeds of 
online messages demonstrated the power and poten- 
tial of the GA-based feature selection model. The 
identified key features could achieve comparable or 
even higher classification accuracy and effectively 
differentiate the writeprint of different online 
authors. 

Currently, several interesting issues in this research 
domain are still open. Given the key features selected, 
we will continue to rank and cluster them based on 
their functional traits, and further provide a visual 
representation of an author's writeprint. Due to the 
multinational nature of cybercrime, we intend to 
employ this feature selection model to identify the 
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key writeprint features in other languages such as Ara- 
bic and Spanish. In addition, we are also interested in 
applying the writeprint identification approach to 
other related problems such as plagiarism detection 
and intellectual property checking. 
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ow By Nenad Jukic 


MODELING STRATEGIES 
AND ALTERNATIVES tor 


DATA WAREHOUSING PROJECTS 


Choosing the appropriate modeling approach ts 
often the critical factor in the success or failure of a 
data warehousing unplementation. 


ata warehousing has become a | 


standard practice for most 

large companies worldwide. 

The data stored in the data 

warehouse can capture many 

different aspects of the busi- 

ness process, such as manufactur- 

ing, distribution, sales, and marketing. This 

data reflects explicitly and implicitly cus- 

tomer patterns and trends, the effectiveness 

of business strategies and resultant practices, 

and other characteristics. Such data is of vital 

importance to the success of the business 

whose state it captures. This is why compa- 

nies decide to engage in the relatively expen- 

sive undertaking of creating and maintaining 

a data warehouse, where the costs routinely 
reach millions of dollars [11]. 

For a data warehousing project to succeed, 

it is essential to choose a suitable data model- 

ing approach. Organizations considering a 


data warehousing project should examine the 
real differences and trade-ofts between avail- 
able methodologies and determine for them- 
selves which approach is best suited for their 
environments. Despite the growing pervasive- 
ness of data warehouses, there is hardly a con- 
sensus among researchers and_ practitioners 
about the most appropriate data modeling 
strategies for data warehousing projects. In 
order to help readers recognize what the 
choices are and the implications of making a 
particular selection, this article provides an 
impartial and concise view of the competing 
methodologies and the issues that drive the 
ongoing debate about them. 


Data WAREHOUSES AND DaTA Marts 

A typical organization maintains and utilizes 
a number of operational data sources. These 
operational data sources include the data- 
bases and other data repositories that are used 
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to support the organization’s 
day-to-day operations. A data 
warehouse is created within an 


ron Table 


Ixl 


‘ . cP Camping 
organization as a separate data Footwear 2x2 
store whose primary purpose is Lied 
data analysis for the support of — " Gheasolend 


SI 
T Tristate $2 


management's decision-making 33 
process [7]. Often, the same fact 
can have both operational and 
analytical purposes. For exam- 
ple, data describing that cus- 
tomer X bought product Y in 
store Z can be stored in an oper- 
ational data store for business- 
process support purposes such as 
inventory monitoring or financial transaction record 
keeping. That same fact can also be stored in a data 
warehouse where, combined with vast numbers of 
similar facts accumulated over a time period, it 
serves to reveal important trends such as sales pat- 
terns or customer behavior. 

There are two main reasons 
that necessitate the creation of a 
data warehouse as a separate ana- 
lytical data store. The first reason 
is that the performance of opera- 
tional queries can be severely 
diminished if they must compete 
for computing resources with 
analytical queries. The second 
reason lies in the fact that, even if 
the performance is not an issue, 
it is often impossible to structure 
a database that can be used 
(queried) in a straightforward 
manner for both operational and 
analytical purposes. Therefore, a 
data warehouse is created as a 
separate data store, designed for accommodating 
analytical queries. A typical data warehouse periodi- 
cally retrieves selected analytically useful data from 
the operational data sources. In so-called active data 
warehouses [2], the retrieval of data from opera- 
tional data sources is continuous. For any data ware- 
house, the infrastructure that facilitates the retrieval 
of data from operational databases into the data 
warehouses is known as ETL, which stands for 
Extraction, Transformation, and Load. 

A data mart is a data store based on the same prin- 
ciples as a data warehouse, but with a more limited 
scope. Whereas a data warehouse combines data 
from operational databases across an entire enter- 
prise, a data mart is usually smaller and focuses on a 
particular department or subject. Dimensional mod- 
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Table 1. Sample data 
for operational 
databases A and B. 
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Table 2. Sample data 
for data mart C. 
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Operational Database A 
Product Table 
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eling [8] is a principal data mart modeling technique. 
It uses two types of tables: facts and dimensions. A 
fact table contains one or more measures (usually 
numerical) of a subject that is being modeled for 
analysis. Dimension tables contain various descrip- 
tive attributes (usually textual) that are related to the 
subject depicted by the fact table. The intent of the 
dimensional model is to represent relevant questions 
whose answers enable appropriate decision making 
in a specific business area [4]. 


Data Mart C 


123 Execellent Hispanic 
456 Good Asian 


-2-333 Tina 
2-3-444 Tony 


2005 7 
2005 


Ix! Zzz Bag Camping 
Chicagoland a 2x2 Easy Boot Footwear 
Tristate 3 3x3 Cosy Sock Footwear 


Chicagoland | 


eG en 


| 
a 
3 
| 


The two figures here illustrate an example where 
dimensional modeling is used to design a data mart 
that retrieves data from two operational relational 
databases. Figure 1 shows two separate operational 
databases for a retail business. Table 1 shows the sam- 
ple values of data stored in databases A and B. The 
operational database A stores information about sales 
transactions. In addition to transaction identifier and 
date, each sale transaction records which products 
were sold to which customer and at which store. 
Operational database B stores information about cus- 
tomers’ demographic and credit rating data. 

In order to enable analysis of sales-related data, a 
dimensionally modeled data mart C, shown in Figure 
2, is created. This data mart contains information 
from relational databases A and B. The purpose of 
data mart C is to enable analysis of sales across all 
dimensions that are relevant for the decision-making 
process and are based on existing and available opera- 


tional data. The fact table SALES 
contains a numeric measure (Units 
Sold) of the subject sales and for- 
eign keys pointing to the relevant 
dimensions. Dimension Customer 
integrates the data from the Cus- 
tomer Table in database A and all 
three tables in database B. Dimen- 
sion Store merges data from tables 
Store and District in database A. 
Dimension Product merges data 
from tables Product and Category, 
also from the database A. Dimen- 
sion Calendar contains details for 
each date that fits the range 
between the date of first and last 
transaction recorded in Sales Transaction Table in 
database A. 

The dimensions Customer, Store, and Product are 
not normalized. Since the data in data marts is almost 
never updated, given that most data marts are “read 
and append only” data stores, the main motivation for 
normalization (the elimi- 
nation of the possibility of 
update anomalies) does 
not exist in this case. On 
the other hand, de-normal- 
ized (pre-joined) dimen- 
sions can greatly improve 
the performance and con- 
venience of analysis, by 
reducing the need for join- 
ing tables in queries. 

Each dimension has a 
new key, specially designed 
for the dimension itself. As 
shown in Table 2, the value of the key is not imported 
from the operational database. Instead, the value of 
the key is a unique system-generated semantic-free 
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Figure 1. ER-modeled 
operational database. 
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Figure 2. Dimensionally 
modeled data mart C. 
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identifier. This feature insulates dimensions from pos- 
sible changes in the way operational keys are defined 
(and possibly redefined) in operational databases over 
time. The system-generated key also has a role in 
tracking the history of changes in dimensions’ records. 

Once data mart C is modeled using dimensional 
modeling techniques, and then populated with the 
data from A and B, finding answers to questions such 
as “Find the top 10 products sold to customers of 
demographic type ‘His- 
panic’ and credit rating 
‘Good’ during the 
month of August for the 
past three years” is 
achieved in a quick 
fashion by issuing one 
simple query. If the data 
mart was not devel- 
oped, the process of 
finding an answer to 
this question would be 
much more complicated and would involve rummag- 
ing through the operational databases A and B, and 
issuing multiple queries. 

When a dimensionally modeled data mart is in 
place, performing data analysis is fairly straightfor- 
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Contemporary methodologies OFFER SEVERAL 
DATA MODELING OPTIONS for designing a data warehouse. 
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ward. It usually involves using online analytical pro- 
cessing (OLAP) [3] tools, which allow users to query 
facts and dimensions by using simple point-and-click 
interfaces. 


Data WAREHOUSE MODELING OPTIONS 
Contemporary methodologies offer several data 
modeling options for designing a data warehouse, 
which are described and com- 
pared here with regard to how 
they approach and utilize the 
data mart modeling process. 
One option for modeling 
data warehouses, first proposed 
by Inmon [7], envisions a data 
warehouse as an integrated data- 
base modeled by using the tradi- 
tional database modeling 
technique (ER modeling). After 
such a data warehouse is created, 
it then serves as a source of data 
for dimensionally modeled data 


Figure 3a. 
marts and for any other (non- Seusudbled' dts 
dimensional) analytically useful warehouse. 
data sets. Figure 3a_ illustrates icine th 
this option. Dimensionally 

The idea behind this method modeled data 

‘ A warehouse. 
is to have a physically stored cen- 

tral data warehouse modeled as Figure 3c. 

an Entity-Relationship (ER) i tide uallohir 

marts. 


model. All integration of the 

underlying operational data 

sources occurs within a central data warehouse ER 
model. As mentioned previously, the process of inte- 
grating and consolidating data from the operational 
databases into the data warehouse is known as the 
ETL process. Developing the infrastructure needed 
for the ETL process is usually the most time- and 
resource-consuming part of the entire data ware- 
housing effort [9]. It is not uncommon for a data 
warehousing project team to spend as much as 
50%-70% of development time on ETL functions 
[10]. However, this entire process is driven by its tar- 
get: the data warehouse model. When a proper data 
warehouse data model is developed, the require- 
ments for the ETL development process are clearly 
defined. The success of the ETL development phase 
is then a matter of accurate and efficient execution. 
Once a data warehouse is completed and populated 
with the data, various analytically useful extracts are 
possible, based on this powerful fully integrated rela- 
tional database. One of the primary types of analyt- 
ical extracts is indeed a dimensionally modeled data 


mart, which is then queried using OLAP tools. The 
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Inmon method envisions the need for other, non- 
dimensional, data set extracts as also potentially use- 
ful for analysis and decision support. Such 
non-dimensional extracts may include single tables, 
data sets intended for data mining, flat files, and so 
forth. Inmon and his collaborators view these 
extracts, the data warehouse itself, as well as addi- 
tional analytical data stores, as part of a larger con- 


DATA MART 
Dimensional Model 


DATA MART 
Dimensional Model 


OTHER EXTRACTS 


INDEPENDENT 


cept they call a Corporate Information Factory 
(CIF) [5]. 

Another method, championed by Kimball [8], 
views a data warehouse as a collection of dimension- 
ally modeled data marts. Figure 3b illustrates this 
option. As Figure 3b illustrates, this approach is anal- 
ogous to the previous approach when it comes to the 
utilization of operational data sources and the ETL 
process. The difference is the modeling technique 
used for modeling the data warehouse. In this 


Corporations and organizations should choose between 
these modeling methodologies based on the NATURE OF THEIR 
CURRENT AND FUTURE ANALYTICAL NEEDS. 


approach, a set of commonly used dimensions (such 
as Calendar) known as conformed dimensions is 
designed first. Fact tables corresponding to the sub- 
jects of analysis are then added. A set of dimensional 
models is created where each fact table is connected 
to multiple dimensions, and some of the dimensions 
are shared by more than one fact table. In addition to 
the originally created set of conformed dimensions, 
additional dimensions are included as needed. The 
result is a data warehouse that is a collection of inter- 
twined dimensionally modeled data marts. 

The trade-off between the two methodologies can 
be described as a trade-off between extensiveness and 
power versus quickness and simplicity. The Inmon 
approach requires the creation of a data warehouse ER 
model as a first step. The result of this process can 
then be used in subsequent steps as a basis for model- 
ing dimensional and non-dimensional extracts. In the 
Kimball approach, dimensionally modeled structures 
are created without creating an underlying ER model 
for them. If dimensional structures are all that an 
organization will ever require to fulfill its data analysis 
needs, then the Kimball approach is a quicker and 
simpler way to create a data warehouse. However, if 
other types of analytical data stores will be needed in 
addition to the dimensional structures, then the 
Inmon approach provides a more powerful method. 
An ER modeled data repository can be used as a basis 
for extracts into data stores structured in a variety of 
ways. On the other hand, the dimensional model is 
developed strictly for end user OLAP-style analysis. 

Corporations and organizations should choose 
between these modeling methodologies based on the 
nature of their current and future analytical needs. 
The analysis of those needs should indicate to them 
which method is more appropriate and suitable for 
their environments. 

Comparing these two approaches has been a topic 
of lively debate in the data warehousing field for the 
past few years [1, 5, 6]. Often, attempts are made to 


describe one method as superior to another. While 
doing so, certain supposed comparative disadvantages 
are frequently mentioned for one (or the other) of the 
approaches. The most common such comment about 
the Inmon approach is that its data modeling phase 
requires high levels of expertise and a considerable 
upfront time commitment. On the other hand, the 
most common criticism of the Kimball approach is 
that it lacks enterprisewide focus and concentrates pri- 
marily on the individual business units or groups of 
users. However, a closer inspection of these arguments 
reveals they do not withstand impartial scrutiny. 

Looking first at the Inmon approach, it is obvious 
that it requires more time to be spent on modeling, 
but this is due to the fact that the created model is 
usable in a number of different ways. Also, recall that 
the majority of data warehousing projects are serious 
multimillion-dollar endeavors, and that the majority 
of resources must be spent on the ETL process (whose 
success is highly dependent on the data modeling 
phase). Having this in mind, it is difficult to see how 
ensuring that the data modeling aspect is done in a 
non-hurried way using a high level of expertise is a 
shortcoming or inconvenience. 

As mentioned, the common criticism of the 
Kimball approach is it lacks enterprisewide focus. 
The Kimball approach does allow the fact tables to 
represent subjects of interest to the individual busi- 
ness units or group of users (for example, a fact table 
showing employee sick days, used by the HR 
department). However, the same approach also 
allows for the creation of enterprisewide fact tables 
used simultaneously by many (if not a majority) of 
the departments and units within the organization 
(such as a fact table showing complete global and 
domestic sales, and profits for all products.) There- 
fore, the assertion that the Kimball approach is 
incapable of providing enterprisewide focus is sim- 
ply not accurate. 


The reality is that both approaches offer a viable 
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alternative for modeling and creating data ware- 
houses. When choosing a data modeling approach for 
a data warehouse, the decision should be based on 
which approach is a better fit, rather than on trying to 
decide which methodology is “better.” An organiza- 
tion may even choose a mixture of these approaches 
(for example, using dimensional modeling for con- 
ceptual design of subject areas and then using this 
conceptual design as a basis for creating the ER model 
of an actual physical data warehouse). 

Even though the majority of the discussions about 
data modeling approaches for data warehouses 
involve the Inmon and Kimball approaches (and their 
variations), there is a third approach that should also 
be acknowledged and discussed. This approach 
involves the creation of independent data marts, as 
illustrated by Figure 3c. 

In this method, standalone data marts are created 
independently of other data marts in the organiza- 
tion. Consequently, multiple ETL systems are created 
and maintained. 

Whereas there is ongoing discourse among data 
warehousing practitioners and researchers on the mer- 
its of the Inmon vs. the Kimball approaches, there is 
a consensus among virtually all members of the data 
warehousing community about the inappropriateness 
of using the independent data marts approach as a 
strategy for designing a data warehouse. There are 
obvious reasons why independent data marts are con- 
sidered an inferior strategy. Two major shortcomings 
are the unnecessary repetition of the ETL effort and 
the inability for cross-department analysis and com- 
munication. In spite of these obvious disadvantages, a 
significant number of corporate data warehousing 
projects end up being developed as a collection of 
independent data marts. The reason for this seeming 
paradox lies in the lack of initial enterprisewide focus 
when data analysis is concerned. Simply, a number of 
departments within an organization take a “go-at-it- 
alone” approach in developing the data marts needed 
for their analytical needs. This is commonly due to 
the “turf” culture in organizations, where individual 
departments put more value on their independence 
than on cross-department collaboration. In those 
cases, the existence of independent data marts is more 
of a symptom of the organization’s structural culture 
than a result of deliberately adopting an inferior data 
warehousing approach. Moreover, in some cases the 
budgeting structure of an organization forces parts of 
the organization to undertake isolated initiatives. 
When contemplating the development of data analy- 
sis systems, departments and other constituent groups 
within organizations are often presented with the 
choice of creating independent data marts or doing 
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nothing. Given those two choices, independent data 
marts certainly represent a better option. 


SUMMARY 

The issue of data warehouse modeling choice is 
often misunderstood or simply avoided by accepting 
a perceived benchmark practice by default. How- 
ever, the choice of a proper modeling approach is a 
crucial decision, which often determines whether 
the data warehouse implementation will succeed or 
end up as a costly failure. 

Creating a data warehouse is an effort that 
requires participation from across an assortment of 
organizations functional areas and the integration 
of data from multiple sources. Consequently, people 
associated with data warehousing projects come 
with different backgrounds and various technical 
levels of expertise. The description of central issues 
determining a data modeling approach for the data 
warehouse given in this article can help all parties 
involved with the data warehousing project to 
understand the available alternatives and make a 
contribution toward making an appropriate choice 
for their organizations. @ 
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BY Andrea Ordanini 


HAT DRIVES 


MARKET 


T 


RANSACTION 


IN B2B EXCHANGES’? 


A successful digital marketplace exploits content, governance, 


and structure tn tts business model to help generate revenue- 


producing transactions between buyer ano seller firms. 


The failure of many digital marketplaces in 
recent years reflects how difficult it is for 
managers to exploit the potential of digital 
interconnections between businesses. The 
vast majority of newcomers fail to achieve 
liquidity in their target markets no matter 
how long they stay in business, and few 
exchanges are able to manage a consistent 
flow of activity in business-to-business envi- 
ronments [4]. 

Data from a survey I conducted in 2003 on 
a sample of 45 European Union exchanges 
culled from the emarketservice.com directory 
reveals a strong positive relationship between 
the revenue of a B2B exchange and the share 
of that revenue derived from transaction fees; 
at the same time, a marketplace that counts 
mainly on other revenue sources (such as sub- 


scriptions, advertising, and non-exchange ser- 
vices) faces yet more difficulty on the path of 
growth. In this sense, the choice of revenue 
model directly affects revenue (see the table 
here). 

The revenue model is concerned with value 
appropriation decisions, namely the modes in 
which a business model enables revenue gen- 
eration. But revenue model decisions also 
derive directly from business model features; 
in this case, successful marketplaces reflect cer- 
tain distinguishing characteristics compared to 
other B2B exchanges. In this way, researchers 
focusing on the digital exchange environment 
exercise their creativity by identifying new 
business models [5, 10]. Their thinking is gen- 
erally inductive; business models are designed 
using a set of qualitative variables, and cate- 
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gories are created from 
scratch, sometimes taking 
a successful initiative as a 
reference point. 

Even from a theoreti- 
cal point of view, there is 
no consensus on the 
meaning of the term 
“business model,” which has different connotations in 
different contexts [8]. Here, I employ it in a more 
deductive sense, as a “set of descriptors” for more-or- 
less successful initiatives. Since the term refers largely to 
the internal rules employed by firms doing business in 
the digital environment [1] or to the architecture these 
firms leverage to organize activities in the value net- 
work [11], I considered an e-business model as a frame- 
work consisting of three layers [2]: 


Revenues -0.33 


Correlating revenue and its 
sources. 


* Content, or market 
positioning, of an 
exchange in terms of 
what is being 
exchanged and, above 
all, who are the target 
customers; 

¢ Governance, or the 
rules and incentives 
provided by the vari- 
ous shareholders; and 

e Structure of the market, or the mechanisms and 
services enabling it to function. 


Business 
Model 


Findings from my 2003 survey, which was sponsored 
by Bocconi University, reveal that in each layer, a suc- 
cessful marketplace can be expected to exhibit distin- 
guishing features with potentially important technical 
and managerial implications. 


EmpiRICAL EVIDENCE 
The emarketservice.com directory includes more 
than 1,000 exchanges worldwide, from which I 
selected the 45 European operators focusing on 
B2B and with commercial offices in more than one 
EU country, thus skipping the much less active 
domestic exchanges. Of the 45, I chose a sample of 32 
and collected data through a questionnaire I devised 
and emailed to CEOs, as well as by looking at their 
Web sites. I used the customer target as a proxy to rep- 
resent the content dimension of the business model, 
particularly whether an exchange had a specific orien- 
tation emphasizing small and medium enterprises 
(SMEs). 

I investigated the governance dimension through 
the relative percentage of shareholder control, such as 
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whether established firms 
were involved as share- 
holders or financial share- 
holders controlled the 
exchange. 

I analyzed structure issues through the presence of 
dynamic or static exchange mechanisms, focusing on 
the presence of auctions in the offering (see Figure 1). 
Figure 2 includes the empirical evidence—the relation- 
ships between an exchange’s ability to perform inter- 
mediate transactions and the choices its management 
might have made in terms of customer targets, share- 
holder structure, and the presence of auctions in the 


offering. 


CusTOMER Focus 

The choice of target customers is a distinguishing fea- 
ture of the most vital exchanges. In particular, the 
inclination toward emu- 
lating a small-business 
environment is an obsta- 
cle to growth, and the sur- 
veys results tend to 
confirm the effects of 
business models and 
small-business design. 

B2B exchanges that do 
not view SMEs as their 
primary target market get 
a larger share of their rev- 
enue (46% as opposed to 
15%) from transaction 
activities, while small-busi- 
ness-focused _ initiatives 
must target other non-exchange revenue sources (such 
as advertising and service fees) for which the digital 
dimension is significantly less compelling. 

Early B2B exchanges were largely inclined toward 
the needs of SMEs, feeling that the digital marketplace 
model was the best fit for fragmented markets where 
the buyer-side of the exchange suffers from market 
opacity. But the stronger counterpart—the seller—had 
no incentives for participating in a game where bar- 
gaining power represented the jackpot. Participation in 
a B2B exchange required two notable capabilities only 
rarely present in SME firms: 


Small firms as target 
Large firms as target 


Financial shareholders 


Established firm as 
shareholders 


Static aggregation 
Dynamic matching 
(auctions) 


Figure 1. B2B exchange business 
model. What choices are most 
likely to produce success? 


* Strong integration between transaction activities 
and complementary processes that are costly, time- 
consuming, and risky for many small firms; and 

* Some IT capabilities and learning capacities that 
have a positive influence on B2B exchange partici- 
pation and for which small firms traditionally lag 


behind [6]. 


The negotiation mechanism is the main aggregating factor 
in today’s digital marketplace. 


In addition, many small firms usually find it to 
their advantage to establish a network of long-term 
relationships with a few key suppliers based on mutual 
trust and cooperation, rather than try to pressure sup- 
pliers. This behavior is inconsistent with the goal of 
transaction cost reduc- 
tion in arm’s-length rela- 
tionships [5]. 

From the technologi- 
cal side of B2B exchange 
operations, the superior- 
ity of large firms as a tar- 


SMEs as a 
target 


Business model features 


i Established 
get would _ involve firms as 
selfs te harehold 
reconsidering large stan- bullae 
dardized software _plat- 
forms that were once ae 
criticized as inflexible. auctions 


These platforms posi- 
tively affect the process 
phases of an exchange by 
reducing time and 
resources and are well 
suited to large firms whose internal processes leave 
room for efficiency gains. 

Small firms have not been an especially lucrative 
market focus for B2B exchanges, especially for those 
oriented toward providing transaction benefits. On 
the other hand, paying attention to the issues of large 
firms has a positive effect on potential profitability. 


SHAREHOLDER STRUCTURE 
Another characteristic of most dynamic exchanges is 
their governance structure. The results of the survey 
reveal how B2B exchanges with financial shareholders 
receive a lower share (16%) of their revenue from 
transaction activities, while the corresponding share in 
the private/consortia marketplace is 42%. Even so, 
some marketplaces unable to realize digital intermedi- 
ation are forced to identify and pursue other revenue 
sources involving lower potential profit margins to 
sustain their revenue flows. 

Many exchanges in the early wave of B2B develop- 


Percentage of revenues from transaction fees 


ment were newcomers; their start-up capital was pro- 
vided by venture capitalists and financial institutions 
interested in a mid-term stock price increase in 
exchanges with the potential for quickly transforming 
themselves into publicly held corporations. Unfortu- 
nately for these initial 
investors, the exchanges in 
many cases failed to pro- 
vide or transfer specific 
knowledge and expertise 
regarding the businesses in 
which they would operate. 
Without such knowledge, 
the exchanges had diffi- 
culty making themselves 
attractive to firms looking 
for better ways to do busi- 
ness; they had even more 
difficulty providing ser- 
vices that would improve 
transaction conditions in 
specific contexts. Market- 
places involving a signifi- 
cant percentage of 
financial shareholders are 
more likely to be service providers than true 
exchanges. 

The most direct way for a digital marketplace to 
achieve specific knowledge has been to become an 
exchange sponsored by an established firm with a rel- 
evant position in the market. These operators, called 
private marketplaces or consortia when the sponsor is 
an industry association, are biased in the sense that 
they serve only the buyer-side of the transaction and 
usually involve only one or several big purchasers 
(generally the shareholders). 

The presence of user firms as investors in these 
exchanges sought to improve the efficiency of transac- 
tions in specific business environments, drawing from 
their day-to-day experience and knowledge. In addi- 
tion, partner firms that are also shareholders are tradi- 
tionally more patient than their purely financial 


45.3% 


42.2% 


Figure 2. Features supporting 
transaction activity. 
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counterparts, helping align the time horizon of expec- 
tations of both investors and managers. 

Even if these private exchanges are viewed as 
e-business support for their shareholders, they have 
progressively opened their activity to other buyers, 
reducing the share of intermediation associated with 
captive businesses. Thus, 
the inclination toward 
neutrality, which sup- 
ported the early B2B 
exchanges, should be 
reconsidered—a situation 
that reveals how a win- 
ning technological solu- 
tion reflects a high level of 
customization and flexi- 
bility by the main users 
and owners of an exchange. The level of customiza- 
tion reflects the integration process of an exchange’s 
software platform and the user firms’ business 
processes, thus requiring adaptation of existing rou- 
tines. In this way, successful platforms (in terms of 
both traffic and revenue) appear to be standardized 
(in terms of process automation and customization); 
that is, they are deeply integrated into the business 
process of user firms. 


Business 
Model 


DYNAMIC NEGOTIATION AND AUCTIONS 

The third characteristic defining the capability of 
intermediate digital transactions is the use of tools 
for dynamic matching, particularly for auctions. 
Dynamic tools allow the simultaneous interaction of 
supply and demand, adjusting negotiation condi- 
tions in real time; static tools permit only transac- 
tions on predefined terms [9]. 

The empirical evidence from my survey indicates 
that the auction provides the only discriminating item 
for the transaction capability among the services an 
exchange can provide. This means that the negotia- 
tion mechanism is the main aggregating factor in 
today’s digital marketplace. In fact, the survey showed 
how B2B exchanges that provide auction services 
retain a greater share of revenue derived from transac- 
tion fees—42% as opposed to 17% for exchanges 
that do not offer such service. 

In this sense, it seems that only dynamic tools pro- 
vide a real and complete reduction in transaction 
costs for the exchange, due to the fact that a signifi- 
cant portion of these costs appear after the transaction 
has been realized and cannot be captured through sta- 
tic mechanisms. 

There are two types of transaction costs [8]: ex-ante 
for searching counterparts and coordinating exchange 
procedures and ex-post arising from opportunism and 
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: Large firms as 
' customer target 


adverse selection. Static mechanisms affect the ex-ante 
dimension, since they can help reduce the marginal 
costs for searching for and identifying counterparts 
through various aggregation mechanisms. However, 
they do not affect the ex-post dimension; those effects 
start to appear only during negotiation. 

Dynamic tools for 
negotiation work only 
when ex-post transaction 
costs begin to appear. For 
instance, a reverse auction 
can affect motivation 
costs, shifting the inertia 
of the exchange from 
seller to buyer, rebalanc- 
ing bargaining power, 
since sellers are forced to 
reveal their strategies. In 
this sense, sellers’ poten- 
tially unfair behavior and 
the probability of selecting an ineffective partner are 
reduced. 

This way of countering sellers’ behavior requires 
that the auction enable many participants to join the 
negotiation on the seller-side. It is most likely to hap- 
pen when the auction offers real opportunities for 
sellers to increase their sales through deals otherwise 
impossible even to contemplate. This is precisely 
what happens when the buyer is a large firm, the aver- 
age value of the auction is relatively high, and auc- 
tions are held frequently. 


Established firm as 
stakeholders 


Dynamic matching 
(auctions) 


Figure 3. Strategies for producing 
successful business model 
characteristics. 


IMPLICATIONS 

Successful B2B exchanges are able to profit from 
transactions, focusing on transaction fees as a main 
revenue source (see Figure 3). The business models 
for such exchanges exhibit three distinguishing char- 
acteristics: 


¢ Large firms as main customers (content); 

¢ Incumbent firms as shareholders (governance); 
and 

¢ Auctions as the main mechanism for realizing 
transactions (structure). 


In terms of content, aiming for large firms seems 
the optimal choice for the exchange, since it reduces 
the aggregation effort, amplifying the benefits from 
individual deals. This business strategy suggests a 
rethinking of B2B exchanges as a solution for frag- 
mented markets and as a driver for SME firms seek- 
ing to rebalance their bargaining power toward 
neutrality. In this way, successful exchanges function 
more as tools for standardizing processes and gaining 


efficiencies for large organizations than as mecha- 
nisms for guaranteeing more flexibility to (already 
flexible) SME firms. 

Regarding governance decisions, exchanges with an 
industrial foundation, where manufacturing firms are 
relevant shareholders, emerge as more buoyant. The 
success of private or consortia exchanges sheds new 
light on the role of neutrality in digital markets and 
suggests that financial shareholders may play a signif- 
icant role only in the earliest stages of an exchange’s 
development. 

In addition, captive marketplaces are more likely to 
exhibit deeper integration of technological solutions 
and business processes, leading to superior perfor- 
mance, sustained by shareholder/customer sunk costs. 

Lastly, regarding a market's structure, exchanges 
with dynamic tools for matching buyers and sellers 
(such as auctions) are more attractive to participants 
looking for reduced ex-post transaction costs. In this 
sense, the true exploitation of the potential of a digi- 
tal relationship would focus mainly on the negotiation 
attributes of digital platforms. 

What are the implications of this analysis? First, a 
successful marketplace appears to polarize and rein- 
force previous differences in the market, rather than 
function as an instrument for rebalancing the rela- 
tionship between buyers and sellers. If the dominant 
model, at least among the exchanges in my 2003 sur- 
vey, is that of a private marketplace leveraged to 
improve the purchasing decisions of large firms 
through reverse auctions, the diffusion of B2B 
exchanges works to support already strong operators 
from the buyer-side, providing them cost efficiencies. 
These exchanges do not rebalance bargaining power 
within existing vertical relationships but strengthen 
the transaction capabilities of their already strong 
counterpart buyers and sellers, giving them new busi- 
ness opportunities. 

Also notable is that a successful business model 
does not consist of independent decisions but of a 
series of complementary choices along all relevant 
dimensions, where the choice in any one dimension 
reinforces the other dimensions. The emerging pic- 
ture, based on my survey, shows how the success of 
any B2B initiative is a blend of market positioning, 
governance structure, and services offered. These 
three features should, from a participant’s perspective, 
also be used to address the decision of whether to par- 
ticipate in a B2B exchange and how to choose among 
various potential operators. 

From the exchange point of view, successful soft- 
ware platforms should combine three items: 
¢ Standardization features to guarantee efficiency 

gains; 


* Business process integration to address users’ 
firm-specific efficiency problems; and 

¢ Negotiation tools to allow the active participation 
of counterpart buyers and sellers before, during, 
and after transactions. 


To succeed, Internet-based initiatives in the B2B 
environment must be linked to incumbents’ needs; 
only exchanges that reflect the business strategies of 
established firms may concretely exploit the opportu- 
nities of digital transactions, even if these opportuni- 
ties do not represent an exchange’s exclusive business 
arena. This conclusion is consistent with the vision of 
B2B exchanges as architectural innovators, bringing 
new ways to realize inter-firm transactions in the mar- 
ketplace. They must therefore focus on the profitabil- 
ity of participating firms. @ 
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pores COLOR 
INEMAIL MAKE A 


DIFFERENCE? 


Yes, if used correctly, tt can excite and please, 
: : f . 
prompting reciptents to respond av the sender intended — 


clicking a designated link or even buying something. 


FROM INAUSPICIOUS BEGINNINGS AS A MEANS OF 
communication in the earliest days of computer 
networks some 40 years ago, email has become a killer 
application helping drive the Internet's momentum. 
Surging in tandem, they have given rise to a plethora of 
information in all areas of enterprise operations while 
vying for the attention of end users. In terms of 
marketing strategy, these developments mean the time 


has come for emotionally evocative email. 


ILLUSTRATION BY SERGE BLOCH 
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The email marketing industry was expected to 
grow from $164 million worldwide in 1999 to $7.3 
billion in 2005; meanwhile, the 200 billion market- 
ing email messages in 2005 were expected to surpass 
the amount of marketing material sent by traditional 
postal mail in 2002. 
This means a huge 
potential business 
impact from email 
advertising and mar- 
keting but also a 
major lack of atten- 
tion, as most recipi- 
ents ignore many of 
their email mes- 
sages. The problem 
of getting people's 
attention and mak- 
ing sure they open 
and read their email 
can be overcome by making the messages personal- 
ized and emotionally evocative [8]. In this context, 
color can help increase response rates by grabbing 
attention and evoking positive feelings. 

As email messages become more sophisticated in 
both organization and format, color is likely to play 
an increasingly important 
role in their presentation. 
Unlike color in print, 
color in digital documents 
is inexpensive and increas- 
ingly accessible on popular 
platforms (such as on 
handheld devices _ like 
PDAs and cell phones). At 
the same time, however, 
little is known about how 
to use color in email messages. 

Here, we examine the impact of background color 
on recipients’ responses to email messages that try to 
tempt them to travel to a designated link providing 
more detailed information. We explore the common 
scenario of people receiving a list of incoming email 
messages (their inbox) and the decisions they make as to 
how to react to a particular message—ignore it, read its 


Yellow, pink, 
green, white 


Table 1. Conflicting results of prior 
studies on the relative effects of color. 


Table 2. Colors used as 
background in the message. 


r responses for yellow NA : its 


No significa differences 


displayed information (such as sender and title), defer 
treatment, delete, or just click and open. Our concern 
is with the process, given that a message has been 
opened and the recipient must decide whether to act 
in a way that’s been suggested in the email. When the 
message is open, the 

format of the email 

“Reference message, including 

background 
color, is evident to 
the recipient, and 
the recipient is asked 
to click on a link 
within its text. Our 
research question 
concerns the effect 
of color on a recipi- 
ents decision to travel or not to travel to the recom- 


mended link. 


3,071 Managers [3] 


4,250 Managers [2] 


COMPONENTS OF COLOR 

To discuss the effect of color, we must examine its 
three dimensions: hue, chroma, and value. Hue cor- 
responds to the normal meaning of color—its pig- 
ment (such as red, green, blue). Chroma (saturation) 
is the relative amount of pure light that must be 
mixed with white light 
to produce the per- 
ceived color; colors low 
in chroma appear dull 
in comparison to the 
richness and depth of 
high chroma. Value 
(brightness) is the level 
of lightness of the color; colors high in value appear 
“whitish,” while colors low in value appear “dark- 
ish.” 

Color influences our emotions. It has been 75 
years since the claim [11] that yellow and pink are 
superior background colors for questionnaires, most 
likely to evoke a higher response rate and willingness 
by survey participants to invest time and cognitive 
effort. In marketing, too, color has been tied to con- 
sumer reactions to advertising in research defining 
consumer behavior as a function of the value and 


We can use color of high value and high chroma to create 
a message context in which the recipient is more inclined 
to respond positively to the message. 
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chroma of the color being used [4] and establishing a 
link between color and consumer behavior through 
two dimensions: boredom/excitement and _relax- 
ation/tension. 

It should be noted in this context that one can be 
both excited and relaxed at the same time, as the two 
dimensions are defined separately. On the one hand, 
we know that people are more inclined to act as sug- 
gested in an ad when they are excited rather than 
bored and when they are relaxed rather than tense. On 
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the other hand, colors high in value (more whitish 
colors, like pastels) make one feel more relaxed, and 
colors high in chroma (brighter colors) make one feel 
more excited. We can thus use color of high value and 
high chroma to create a message context in which the 
recipient is more inclined to respond positively to the 
message. 

Research into the link among color, information, 
and computers is not new but has concentrated on 
cognitive rather than affective aspects of human 
behavior. Color is commonly available and easily 
applied in information and communication technolo- 
gies to broaden their scope from mere information 
processing to communication and presentation appli- 
cations. For example, color in business communica- 
tions has been shown to enhance _ reader 
comprehension by presenting different categories in 
different colors [12]. 

Almost 20 years ago, a study of color and informa- 
tion processing [7] concluded, “Color is a subtle vari- 
able that can significantly enhance a decision maker's 


U.S. ee 28.47% 


ability to extract information.” Going a step further— 
beyond facilitating information extraction and manip- 
ulation—color today is intentionally used to influence 
emotions. Indeed, one of the potentially most impor- 
tant areas of using color in email is in motivating con- 
sumers to take time to read more detailed information 
about or to outright purchase a product. 

How are consumers motivated to access more 
information? Our working assumption is that they are 
already overloaded with information, precisely the 

feeling many of us have when 
surfing the Web. We need to be 
tempted to invest more time in 
reading online material; without 
such temptation we might skip to 
the next item, ignoring the mes- 
sage or deleting it without open- 
ing it. In the Internet context, an 
email designer must grab con- 
sumers attention, perhaps by 
being emotionally evocative. 
Background color is an age-old 
technique for manipulating emo- 
tions. Folklore based on anecdotal 
research tells us that restaurant 
walls painted in color (such as 
red) boost one’s appetite. Is the 
same true for online commodi- 
ties? Although not conclusive, the 
evidence on the impact of color 
on consumers’ willingness to act 
seems to indicate that if color 
does have an impact the most 
effective colors excite us without 
making us tense. As noted earlier, 
the colors (such as pastels) most 
likely to do this produce warm yet relaxed feelings. 
Indeed, representative work (see Table 1) provides evi- 
dence to support this conclusion, though it also pro- 
vides evidence to the contrary. Part of the problem of 
sorting these contradictory conclusions may be the 
relatively small sample size of studies on response 
rates, which themselves tend to be low relative to the 
sample size. 


Figure 1. Distribution 
of the customer base 
by country. 


FIELD EXPERIMENT 

Our study examined the use of background colors in 
email messages to motivate recipients to access a 
more detailed description of a product via a hyper- 
link. The field experiment tested the effect of back- 
ground color on a short email message distributed to 
a sample of 1.4 million consumers, derived from the 
customer base of an email host service company 
called IncrediMail (www.incredimail.com). Incredi- 
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Mail provides mailing capabilities 
designed to enrich communica- 
tion; for example, users can 
choose preferences in animation, 
voice, background, and e-card 
text and design and integrate 
them into their messages. 

The IncrediMail customer 
base consisted of more than four 
million subscribers in more than 
70 countries, of which the major 
sources were the U.S. (28%), 
Canada (9%), and Western 
Europe (30%) (see Figure 1). 
The age distribution included 
children up to age 19 (11%); 
20-29 (27%); 30-39 (26%); 
40-49 (19%); 50-59 (11%); and 
60+ (5%); 34% of the subscribers 
were women. The overall population is similar to the 
general patterns of Internet user populations world- 
wide, as reported in general user surveys [5]. The 
results of our field experiment can therefore be viewed 
as highly generalizable. 

In a pre-test we conducted to 
explore distribution and moni- 
toring mechanisms, we distrib- 
uted a message to 50,000 
IncrediMail subscribers and 
monitored them for seven days. 
In that message the distributor 
wished the recipients a “Merry 
Christmas,” inviting them to 
view an e-card collection. This 
paved the way for the main 
experiment, which was included 
in an issue of the monthly 
newsletter (with news items, 
updates, and links) distributed to 
all IncrediMail subscribers. We 
sent the email message in Figure 
2 to the 1.4 million subscribers a Figure 3. Distribution of 
day after Christmas, then moni- Biba as ss 
tored their response for 14 days, Colors indicate the 
up to the beginning of the second background color of the 
week of January. This personal- ial 
ized greeting wished them a “Happy New Year” and 
invited them to download an attached movie. Each 
subscriber received the same message with a different 
color background. 

We randomly divided the sample into five groups of 
280,000 subscribers each. Each group received the 
same message with a different background color: white, 
green, pink, yellow, or blue. Figure 2 lists the definition 


Figure 2. The message 
sent in a variety of 
background colors. 
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of the colors we used in terms of hue, chroma, and 
value. Note they are all relatively high in terms of both 
value and chroma and are indeed the pastels—soft col- 
ors that evoke calm and relaxed feelings. Note, too, that 
yellow and pink, the traditional favorites, are included. 
Our strategy was to elicit the superior colors compared 
with white in the context of email marketing. 

Earlier statistics from IncrediMail on its own activ- 


ities indicate that up to 55% of email recipients open 
their mail, with over 50% clicking on the link to read 
a detailed message. We experienced similar rates in 
our experiment. The overall percentage of experimen- 
tal recipients who opened their mail was around 30%. 
The response rate on the first day of distribution was 
around 4% of the entire sample of 1.4 million sub- 
scribers; the next day, the rate peaked at 10%, then 
gradually declined to less than 0.5% on the 12th and 
13th days after distribution. Figure 3 plots the 
response distribution, which is typical of online sur- 
veys. The percentage of experimental recipients who 
clicked on the recommended link varied from color 


to color: yellow, 52.80%; green, 52.77%; blue, 
51.98%; white, 51.08%; and pink, 50.81%. Yellow 
was first, but white (the basis for comparison) was not 
worst (see Figure 3). 

We used a statistical test (chi-square) to compare 
each color with white. Even though the absolute dif- 
ferences in response rates were not significant, the sta- 
tistical differences were all significant at 0.001. Three 
of the four preferred colors increased the response rate 
by up to 1.72% compared to no color at all. Assuming 
a 5% purchase rate among those who voluntarily trav- 
eled to the detailed description, the increase attributed 
to color represents a 0.2% increase in sales, which can 
mean a great deal of business through the emailing lists 
typically employed in email-based marketing. 


CONCLUSION 

Color is easy to implement, inexpensive, and 
increasingly used in email. But how well does it 
influence a recipient’s responsiveness? Color has two 
main functions—attract attention and set the right 
mood—for responding positively to a message or 
request. And because of our increasingly short atten- 
tion spans and the relatively quick interaction speed 
we expect in today’s electronic world, it must do 
both at the same time. 

Color can be a prime attention grabber when and 
where people's attention is scarce. Communication on 
the Web must therefore be emotionally evocative for 
it to be able to direct attention to the message in ques- 
tion, bearing in mind that color can move people by 
increasing excitement and lowering tension at the 
same time. 

The right color helps generate feelings that have 
been found to induce people to respond favorably to 
requests and marketing communications (such as ques- 
tionnaires and ads). The pastels—the class of colors 
most effective at doing this—are all relatively high in 
value and chroma (see Table 2). Extending previous 
research demonstrating the effectiveness of pastels, 
especially yellow, our study found that color can indeed 
make a difference, albeit a small one, but one that can 
be judged statistically significant due to the large sam- 
ple we used. Moreover, in the age of Internet-based 
mass communication, small yet reliable differences in 
response rates translate into substantial business gains. 

Designers of mass online communication systems, 
including advanced forms of email, Web sites, chats, 
bulletin boards, and pop-up ads, should aim to be 
emotionally evocative. Color is perhaps the first and 
most obvious example of how to make a message 
emotionally evocative, but other elements of design 
(such as type font and size) may be equally important. 
A 2003 study of emotional reactions to Web home- 


pages found that several design factors, including 
graphics, color, and aesthetics, may all be emotionally 
evocative [9]. 

Caution is also called for. Affective qualities of 
human-computer interaction are especially suscepti- 
ble to cultural differences. For example, the color red 
in certain Asian cultures and the color green in Islam 
are usually associated with sacred objects; because 
inappropriate use of these colors is offensive, we delib- 
erately avoided them in our study, but designers must 
be aware of these sensitivities [12]. As we begin to 
understand the factors that help attract attention 
while generating positive feelings, it is also time to 
capitalize on this understanding. 
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BY Rosario Vidal AND Elena Mulet 


“THINKING ABOUT 
(COMPUTER SYSTEMS TO 
SUPPORT DESIGN SYNTHESIS 


What should the CAD systems of the future be like? 


Designers currently have access to numerous computer systems 
that aid them in the different stages of the design process, such as 
the analysis-synthesis-evaluation, and throughout the phases a 
design goes through: conceptual design, embodiment design, and 
detailed design. But whether it is possible to develop a computer 
system for creative design that can automatically invent new solu- 
tions is still controversial. Some researchers believe the very 
nature of creativity is so unpredictable it cannot be explained in 
detail or envisaged by a computer system, although many believe 
it is feasible for future computer-aided design systems (CAD) to 
be expanded in order to support an interactive process with the 
designer [2, 3, 6]. Other researchers defend the potential capacity 
of computers to reach novel solutions through far different mech- 
anisms from those humans use to be creative [1, 5]. 
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Allowing 
collaborative 


work ® 


Stimulating 
creativit 


Designing 
interactively 
with the user 


Figure 1. Functions 
for design support 
systems. 


In our opinion, many design 
process mysteries must be solved 
before a computer system can be 
developed that aids designers significantly more than 
current CAD systems. Furthermore, as far as we 
know, no autonomous design systems have been 
developed to date that offer the same degree of satis- 
faction as working with a human designer. Experi- 
mental research into how designers design can, 
therefore, provide us with valuable knowledge to help 
in developing new computer-aided design systems. 

We attempted to uncover some of the mysteries of 
the design process by developing the MADIS project. 
The ultimate purpose of this project is to implement 
an architecture capable of supporting a design process 
that interacts with the designer, provides knowledge 
management capabilities, and stimulates creativity— 
while also simplifying collaborative work. Our first 
steps were to carry out experimental research involy- 
ing the systematic observation of several groups of 
designers to determine how they generate solutions in 
the conceptual design phase, and how their effective- 
ness varies according to the means of expression used 
(words, drawings, or objects). This research work pro- 
vided us with an initial idea of what functions future 
systems ought to include, and what the interrelation 
between user and computer should be like; it also 

enabled us to establish a series of guidelines with 


Supporting and 
visualising the 
design as it is 


being carried out 


Knowledge 
management 


which to orient the design process. Although creative 
design is unpredictable, following certain guidelines 
can lead to more effective design. 

Based on our review of the literature [3, 10, 12] 
and our research, we believe future design support sys- 
tems must be capable of performing the following 
functions (see Figure 1): 


* Supporting and visualizing the design as it is 
carried out. 

¢ Stimulating creativity. 

* Knowledge management. 

¢ Allowing collaborative work. 

* Designing interactively with the user. 


CAD systems were originally intended to serve as 
a platform on which to develop designs graphically. 
In recent years, however, these systems have 
advanced from 2D to 3D representation and now 
include different modules such as knowledge-based 
engineering (KBE) and computer-aided engineering 
(CAE), as well as tools that allow designers to work 
in collaboration with others. 

Being able to visualize a design plays a fundamen- 
tal role in the design process. Some studies [11] have 
suggested that in addition to acting as a means of 
communication, 2D and 3D visualizations are in fact 
an intrinsic part of the design process. They help to 
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expand the designer’s short-term memory and make it 
easier to come up with solutions. But 3D visualiza- 
tion that simulates the actual handling of objects also 
makes the design process more effective. 

Perhaps the most important point is that during 
the conceptual design phase, drawings can still be 
done by hand much faster than with computers. This 
limitation of computers has negative repercussions on 
the cognitive capacity to solve the design problem. To 
improve this situation, progress is being made in the 
definition of new interfaces and devices, such as pen- 
based input devices, expressive sketching techniques, 
and electronic whiteboards. 

Another important capability needed in future 
CAD systems is the reconstruction of freehand 
sketches, which must allow for the application of 
operations such as partial erasure 
and the definition and modification 
of parameters, as if they had been 
introduced using the traditional 
software interface right from the 
outset. 

Much of the theoretical and 
development work in the area of 
pen interfaces capable of tasks such 
as recognizing gestures and inferring 
3D from 2D outline sketches has 
been accomplished by Mark Gross, 
Ellen Do, and others, under projects 
including Digital Clay, Digital oid touscucdh 
Sandbox, Electronic Cocktail Nap- ele 
kin, and Gesture Modeling. proced res bes as 

The first condition computer oe 
software for design support must 
meet is not being an obstacle to cre- 
ativity. A good atmosphere for cre- 
ative processes does not ensure 
creativity, but being creative does 
become far more difficult if the computer tools being 
used fail to promote a favorable environment because 
they are too complex, too rigid, or too slow. 

A more significant requirement is that the design 
support system must also stimulate creativity. 
Although it is possible to find several computer pro- 
grams aimed at stimulating creativity, they have been 
conceived and projected independently from other 
CAD tools. One of the research issues currently receiv- 
ing much attention in engineering design is the selec- 
tion of creative methods and techniques according to 
the stage of the design process, and the integration of 
both applications in the same system. 

The design process also entails the acquisition and 
management of large amounts of knowledge. To 
obtain, understand, and adapt large amounts of data, 


Figure 2. Guidelines 
for designing with 
a computer. 
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Focuses the initial search for ideas on 
just a part of the problem to solve. 


Combines partial solutions from the outset 


loratory and combinational. 


and then manage data flow throughout the design 
process is extremely difficult without some kind of 
methodology. KBE is a methodology that allows us to 
organize the data flow and to develop an architecture 
under which design automation can be effectively 
implemented. KBE systems create, or in some way 
manipulate, detailed geometries and employ a method 
of representing knowledge that can be interpreted by 
the computer. 

KBE packages such as ICAD and Intent have been 
applied to design automation in projects involving 
large amounts of knowledge and the handling of 
complex geometries. These turn out to be very 
sophisticated systems and are not directly linked to 
graphics software, which has led to their being basi- 
cally (although not exclusively) limited to applica- 


Suggest the designer to solve the 
initial functions. 


The designer can select the most 
suitable solutions and combine them 


she has process by with one another from the outset. 
ational methods. i Foster designer associations 
rtion of transformational Keep the designer working with a 


small number of alternatives. 


tions in the aeronautical or the automotive industries. 

The latest generation of CAD systems is beginning 
to incorporate the tools usually found in a KBE sys- 
tem that expand the possibilities of parametric design 
by including “if-then” sentences, restraints, checks, 
and the use of spreadsheets to collect information. 
The possibilities of automation offered by these mod- 
ules has aroused some curiosity within SMEs, 
although interfaces need some simplifying to make 
them more accessible to traditional CAD software 
users with no specific KBE training. 

Another question related to the integration of the 
two systems, CAD and KBE, is whether CAD sys- 
tems must acquire the functionality of a KBE system 
in order to support various aspects of the product life 
cycle or, conversely, whether CAD systems should be 
built with a completely open architecture to allow 
easy integration of a wide range of KBE software [7]. 

One can suppose that today’s CAD, CAM (com- 
puter-aided machining), and PIM (product informa- 


tion management) systems will become integrated 
product life cycle (PLC) systems, and indeed, a num- 
ber of applications have already been developed that 
go in this direction. Additional PLC elements ought 
to be added to knit together the domains of CAD, 
CAM, CAE, and PIM. These will include, for exam- 
ple, information management systems for require- 
ments and costing, as well as the use of warranties and 
financial information. The resulting architecture will 
allow the creation of an effective collaborative engi- 
neering environment that comprises the whole prod- 
uct management team. Kvan and collaborators [8] 
have noted the importance of using visual and textual 
communication and also virtual environments. Dif 
ferent capabilities will be included to enable the use of 
video conferencing, chat-line communication, elec- 
tronic whiteboards, and other collaborative devices. 
We believe the best approach would be to develop sys- 
tems that interact with the designer, instead of work- 
ing independently, because the cognitive processes 
that take place during conceptual design are very 
unpredictable and without human involvement only 
the more obvious design solutions are reached. 
Computational algorithms would be responsible for 
proposing solutions; the designer would process this 
information offline in order to redefine the problem, 
and then feed it back into the computer and refine the 
design space. It would also be important for the design 
team to take part in the evaluation and the decision 
making concerning the concepts generated by the 
computer as a result of the computational procedures. 


GUIDELINES FOR DESIGN SupPORT SYSTEMS 

After a series of experimental sessions involving 12 
groups of four designers, the design process proto- 
cols used by each group were analyzed to evaluate 
whether the ideas put forth by each participant were 
effective or not. At this point it should be noted that 
very little experimental research has been conducted 
on the design process using statistical techniques to 
check for the existence of repetitive patterns linked 
to a particular factor; such studies would enable 
researchers to draw up guidelines or recommenda- 
tions that increase effectiveness. 

The ideas were evaluated in terms of their quantity, 
and their fulfilment of the initial requirements. If the 
idea contradicted the initial requirements it was con- 
sidered invalid, if it responded to the initial require- 
ments without contradicting them it was considered 
valid, and if it did not meet these requirements it was 
cataloged as a non-related idea. Ideas were aggregated 
on a more general level (called an alternative). This 
aggregate included those ideas belonging or con- 
tributing to a potential design solution. 


The representation of the FBS (function-behavior- 
structure) model of the experiment was also obtained 
to evaluate the progress of the functions and specifi- 
cations of the design, and their relationship with 
the solutions generated. Taking the design protocol 
as our starting point, the different design entities of 
the FBS model can then be identified: the functions 
(F) describing the goals; behavior (B), which 
describes the changes in the state of the structures; 
and finally the structures or solutions (S) [4, 10]. 

To explain how the computer system would be 
able to design on an interactive basis with the 
designer, we made use of Boden’s classification of 
computational models of artificial intelligence for 
creativity in combinational methods and in 
exploratory-transformational methods [2]. Combi- 
national methods generate novel ideas by making 
unusual combinations between ideas that share some 
similarities. The exploratory-transformational mod- 
els include those based on heuristic searches and 
those based on evolutionary techniques. In the pro- 
tocols of our experiment, creative computing meth- 
ods were identified as being those in which the 
computer system generates ideas in the same fashion 
as designers. 

We also set out with Takeda’s computational 
model of synthesis [9], the core of which consists of 
modelling the synthesis as an abduction process, 
according to which solutions stem from the require- 
ments and functions that are selected from the prob- 
lem and from the available knowledge. In turn, the 
analysis consists of using a process of deduction to 
derive new requirements and functions from the 
solutions and the knowledge obtained. 

To illustrate how the design process works, we could 
use the following metaphor. Let’s suppose we have a 
large bag of balls that contain pieces of knowledge 
(we're going to call them ideas). The design problem 
can be broken down into functions and requirements 
for which we have to provide a solution. We would 
take a ball from the bag (which corresponds to the 
exploratory model using Boden’s classification). If it 
fully satisfies our partial problem and does not conflict 
with the rest of the problem, then we would be fin- 
ished (we would say it is an alternative). If the idea is 
of no use to us we would put it back in the bag, but if 
it helps to partially solve our problem, we would keep 
it out of the bag and go on removing balls, combining 
them with one another (combinational), or we would 
bring about changes in them (transformational) until 
we finally reach a valid alternative. 

Our research shows that design is more effective 
when the number of alternatives is lower compared to 
the total number of ideas, and when these alternatives 
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are more interrelated. These circumstances are 
favored when designers work with objects. When 
they only use concepts or static images, the creative 
stimulus to form alternatives is much lower. The 
effectiveness obtained in our experiment when con- 
ceptual design involves objects reaches 80%, yet it 
drops to 22% when conceptual design only makes use 
of verbal concepts [11]. 

This difference, which underlines the importance 
of objects in design systems, is produced for several 
reasons. First, there is a greater number of analysis- 
synthesis-evaluation cycles being carried out itera- 
tively as new ideas are proposed; second, the partial 
solutions combine with one another from the outset, 
when solution principles are still being obtained with- 
out waiting for all the partial solutions to be pro- 
duced; and third, the design solutions are more 
complete, that is, they provide an answer to a greater 
number of design functions. 

Our research suggests computer systems should 
not only allow objects to be visualized and manipu- 
lated, but also offer the designer new alternatives for 
the different functions by search methods later com- 
pleted by transformation procedures. These alterna- 
tives would be interrelated by combination 
procedures, and also with the proposals put forward 
by the designer himself. 

Findings from our study suggest the analysis-syn- 
thesis-evaluation processes can be guided by selecting 
the right methods for obtaining ideas. Some of the 
tasks that could be carried out by the design support 
computer system in order to achieve a more effective 
design would include (see Figure 2): 


* Focusing the initial computer search for ideas on 
just a set of functions chosen by the designer as 
key functions the design must fulfil. It is impor- 
tant to avoid incorporating large numbers of 
additional functions at the beginning. 

* Combining partial solutions from the outset and 
throughout the entire design process by means of 
transformational methods. 

¢ Applying a higher proportion of transformational 
procedures than other types of procedure 
(exploratory and combinational). 


Likewise, the computer system could perform sev- 
eral different tasks to guide the actions carried out 
by the designer, which would be achieved by: 


¢ Encouraging the designer to solve the initial 
functions and offer the partial solutions that have 
already been obtained so the designer can select 
the most suitable and combine them with one 
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another from the outset. 

* Fostering designer associations by displaying on- 
screen the solutions that have already been 
obtained. 

* Keeping the designer working with a small num- 
ber of alternatives and alerting the designer when 
it might be wise to work with another alternative 
or to seek a new one, after generating a reason- 
able number of ideas within the same alternative. 


When the objective is originality, rather than effec- 
tiveness, a higher number of exploratory procedures 
must be applied and the designer will have to intro- 
duce a greater number of additional functions in 
order to increase the probability of generating novel 


ideas. @ 
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By HARVEY G. ENNS, THOMAS W. FERRATT, 


and JAYESH PRASAD 


BEYOND STEREOTYPES OF 


IT PROFESSIONALS: 
IMPLICATIONS FOR 


IT HR PRACTICES 


IT professtonals are complicated — managers need to go 


beyond dlereotyped to truly understand them. 


What is your image of an IT professional? 
A popular image is the “technology geek” 
[6]. Such caricatures of the IT profes- 
sional can create problems for the profes- 
sion by reducing its attractiveness to 
prospective entrants. Likewise, stereotypi- 
cal images held by managers can create 
potentially serious problems for IT profes- 
sionals and their employers since IT 
human resource (IT HR) practices are 
based on managers’ views. If IT HR prac- 
tices are based on invalid images of IT 
professionals, HR practices will be inef- 
fective, resulting in such negative conse- 
quences as higher-than-expected turnover 
and decreased performance. Conse- 
quently, it is important to compare stereo- 
types with the actual characteristics of IT 
professionals to ensure an organization's 
HR practices are based on a valid descrip- 
tion of those professionals. 

Toward this end, we identify three 
stereotypes pervasive in the IT manage- 


ment literature and use survey data from 
180 members of a national organization of 
IT professionals to explore whether the 
stereotypes are valid descriptions of them. 
Survey participants reported holding the 
following mainline IT positions: IT profes- 
sional with no supervisory responsibilities 
(36%), IT supervisor (10%), middle-level 
IT manager (30%), or senior IT manager 
(24%). They were predominantly men 
(72%) and white (88%). On average, they 
had been with their current employer for 
nine years and were 48 years old. The ques- 
tions we asked to examine the validity of 
the stereotypes are reported in the discus- 
sion of the stereotypes described here. 

To preview our findings, we demon- 
strate that the stereotypes describe some, 
but not most, IT professionals. As a result, 
IT managers must go beyond the stereo- 
types to more fully understand the IT pro- 
fessionals they manage. This better 
understanding should help managers estab- 
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lish more appropriate HR practices than if they 
assumed the majority of IT professionals fit the 
stereotypes. 


STEREOTYPE #1: THE “HIGH MAINTENANCE” IT 
PROFESSIONAL 

According to this stereotype, IT professionals pro- 
vide significant value to their organizations, but they 
expect their organizations to meet their many needs. 
Expectations of these high achievers include more 
pay, benefits, interesting work, recognition, and 
opportunities for growth and development. These 
IT professionals define more interesting work as the 
opportunity to work on hot projects with the latest 
and greatest information technology. They spend 
long hours, preferably at their time and place of 
choice, learning new technologies and determining 
how to make systems work. Nonetheless, they want 
the organization to provide them with training and 
new technologies to help them keep current. They 
want to be challenged to make great things happen 
with these technologies and to be appreciated for 
their contributions. Fundamentally, the character of 
this stereotype is consistent with research showing 
that high performers are motivated by specific chal- 
lenging goals and feedback [3, 5] and that IT pro- 
fessionals have a high need for growth and 
development [2]. 

An organization wishing to motivate and retain IT 
professionals would need to provide the many incen- 
tives described in this stereotype, making “High 
Maintenance” an apt description of these high achiev- 
ing employees. IT managers fear these employees will 
go elsewhere if the organization does not give them 
these incentives. These stereotypical IT professionals 
are less reliant on their organizations to provide secure 
employment since they rely on their technical skills to 
find alternative employment if they do not get what 
they want. This is particularly evident when market 
conditions are favorable for IT professionals. 

To determine how closely the “High Maintenance” 


stereotype matches reality, we examined whether the 
incentives noted in the description of such IT profes- 
sionals are important to our sample of IT profession- 
als. Motive profiles were developed based on 
participants’ ratings of the importance of items 
related to security, achievement, and flexibility needs. 
Response scales ranged from 1 (low importance) to 5 
(high importance). The items for security needs were 
job/income security and level of benefits. The items 
for achievement needs were career development 
opportunities, recognition, and specific performance 
requirements. Finally, the items for flexibility needs 
involved freedom to choose when and where to work. 
Importance scores for security, achievement, and flex- 
ibility needs were derived by averaging responses on 
related items. The scores for these three needs were 
then used in a cluster analysis, which combined indi- 
viduals with similar scores into groups. The cluster 
analysis resulted in three groups of IT professionals 
with the distinct motive profiles shown in Figure 1. 
Only one of these three motive profiles matched 
Stereotype #1. 

Sixty-five IT professionals (or 36% of our sample) 
were placed into a group we labeled “High Mainte- 
nance.” Their lower security needs, compared to the 
other two groups, the “Lifestyle” and “Committed” 
groups, suggest they are confident of finding alterna- 
tive employment if their current arrangement does 
not match their expectations. In contrast, their higher 
achievement needs (relative to the Lifestyle group) 
and higher flexibility needs (compared to the Com- 
mitted group) indicate they want the many perks 
consistent with the stereotype. Thus, this group pro- 
vides support for Stereotype #1. 

Forty-five IT professionals were placed into the 
Lifestyle group. Contrary to the stereotype, this 
group, consisting of 25% of the IT professionals, was 
actually less achievement oriented. Relative to the 
other groups, it is not as important for the Lifestyle 
group to receive recognition from their organization. 
It is also less important that the organization provide 


The challenge for managers td to recognize that a 


large percentage of their IT professionals are probably not tn the 


High Maintenance group and, thus, managers must make trade-offs 
about the types of incentives provided to their IT staff. 
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career development opportunities or specific perfor- 
mance requirements. IT professionals in this group 
may have a more independent lifestyle and, thus, may 
not be as interested in their organizations providing 


these | achievement-ori- 
ented incentives. They 
may be intrinsically moti- 90 ~ 
vated, more self-confi- 80 
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dent, and thus more 
self-sufficient than others. 

To illustrate, consider 
how members of this 
group may address career 
development. They may 
rely on self-directed learn- Or edi ahenaes WH 
ing obtained via the Web 
and discussions with their 
peers; thus, they are less 
reliant on their organiza- 
tions to provide career 
development opportuni- 
ties. Alternatively, the 
lower importance of career development opportuni- 
ties may mean that family commitments or other pri- 
orities, perhaps resulting from their career stage, lead 
these IT professionals to seek a more balanced lifestyle 
between work and non-work (including family or 
community) activities. This is consistent with the 
Lifestyle group’s need for the greatest amount of flex- 
ibility compared to the other two groups (see Figure 
1). As one IT professional noted, “I have three chil- 
dren and I’m married. I still want to have a life” [4]. 
The bottom line is that the Lifestyle group is less 
reliant on the organization to meet achievement needs 
than the other groups. 

Seventy IT professionals (or 39% of our sample) 
were placed into the Committed group. These IT 
professionals willingly work when and where the orga- 
nization wants, since such flexibility is relatively 
unimportant to them compared to the High Mainte- 
nance and Lifestyle groups. 

Taken together, this evidence suggests that the 
High Maintenance stereotype does not always hold. 
The higher security needs of both the Committed 
group and the Lifestyle group stand in contrast to the 
lower security needs of the IT professionals in Stereo- 
type #1. With 64% (115/180) of IT professionals in 
these latter two motive profiles, only about one-third 
of those in our sample conform to the High Mainte- 
nance stereotype. 

The challenge for managers is to recognize that a 
large percentage of their IT professionals are probably 
not in the High Maintenance group and, thus, man- 
agers must make trade-offs about the types of incen- 
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Figure 1. Motive profiles 
by need categories. 
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tives provided to their IT staff. For example, having 
resources devoted to employment guarantees is more 
important for the Lifestyle and Committed groups 
than the High Maintenance group. The High Main- 
tenance and Lifestyle 
groups might need extra 
dollars allocated for 
telecommuting equip- 
ment, in line with their 
higher flexibility needs, 
while the Committed 
need less of those 
resources. Similarly, for 
IT professionals in the 
| Lifestyle group, managers 
could reduce the number 
of formal opportunities 
or development dollars 
allocated to learn new 
technologies while increasing them for the High 
Maintenance and Committed groups. To have HR 
practices based mainly on Stereotype #1 will clearly 
result in misaligned or wasted resources. 
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STEREOTYPE #2: THE “OLDER, STATIC” IT PROFESSIONAL 
According to this stereotype, older IT professionals 
are radically different from younger IT professionals 
since they have a lower preference for achievement, 
specifically career development opportunities [1, 8]. 
There is a belief that IT professionals who have 
learned a skill and maintained that skill for years 
cannot adapt or learn a new one [9]. Consequently, 
organizations may devote less time and fewer 
resources toward updating the IT skills of this older 
group. Taking all of this into consideration, the mar- 
ketability and job mobility of older IT professionals 
may be impaired. 

To explore how well this stereotype matches reality, 
we first investigated whether IT professionals follow 
what is often considered a normal career progression, 
with younger employees having short job tenure (pre- 
sumably the youngest, most energetic IT professional) 
and older employees having long tenure (presumably 
the oldest, most static IT professional). Cluster analy- 
sis, similar to that described in Stereotype #1, was 
used to identify career stages. With age and number of 
years in the current organization as the clustering vari- 
ables, four distinct career stages were identified in our 
sample, as shown in Figure 2. These career stages, 
used in examining Stereotype #2, raise questions 
about this stereotype. 

For instance, older IT professionals are not all alike 
with respect to mobility. As shown in Figure 2, both 
the “Middle-Aged” and “Older-Movers” in our sam- 


COMMUNICATIONS OF THE ACM April 2006/Vol. 49, No. 4 107 


ple do not fit the stereo- 
type that job tenure 
increases with age. The 
Middle-Aged group does 


not seem to differ from 


opportunities for those 
older IT professionals 
who are willing to 
change, repackage them- 
selves, and acquire new 


the younger group, and 5 skills to be successful. 
the Older-Movers are rad- = ™—-30- sie These employees may 
ically more mobile than ms have valuable knowledge 
. . 20 . . 
their slightly younger of a firm and its applica- 
counterparts in the e _ tions into which they can 
aetna group. 5 ce a or ror 
is evidence is counter ; : ceivably, this would make 
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to the stereotype that (N=22) (N=54) (N=45) (N=58) them even more valuable 
older IT professionals Career Stage than a younger IT pro- 
have limited mobility. © Average Age - Average Years In Organization fessional who does not 
When we analyzed the have such firm-specific 
groups by motive profiles _ Figure 2. Career stages: Average knowledge and skills. 


across career stages, we age and years in organization. 


found further evidence 

that Stereotype #2 is not a valid description of older 
IT professionals. For example, consider those in the 
Lifestyle group. IT professionals in this group look 
less for career development opportunities from their 
organizations. If the stereotype were true, the Lifestyle 
group would dominate the older career stages and be 
relatively rare in the younger career stages. Not only 
did this group not domi- 
nate the older IT profes- 
sionals (having only 25% 
in the older career 
stages), it also made up a 
similar percentage (24%) 
of the two younger career 
stages (see Figure 3). 

For the High Mainte- 
nance and Committed 
IT professionals, we 
found that Stereotype #2 
also did not hold. 
Within these two higher 
achievement oriented groups, the Older-Stayers had 
higher preferences for job/income security, level of 
benefits, career development opportunities, and 
recognition than the Older-Movers. Perhaps the rea- 
son these preferences are higher among Older-Stayers 
is that they have not had the opportunities the Older- 
Movers have had to satisfy their needs through job 
moves. These results suggest that all older IT workers 
are not alike. 

Clearly, age alone is not a good basis for designing 
IT HR practices. The results suggest that HR prac- 
tices should be based on a more complex combina- 
tion of age, tenure, and motive patterns. Counter to 
the stereotype, companies should continue to provide 
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STEREOTYPE #3: THE “TECHNOLOGY ANCHORED” 
IT PROFESSIONAL 
Consistent with the “technology geek” image, IT 
professionals define themselves and their value in 
terms of the number and difficulty of the technical 
skills they have mastered [8]. Their careers are 
anchored in technology. Since the half-life of tech- 
nologies is short, anyone who does not keep up is a 
dinosaur. Like the 
dinosaur, these out-of- 
date individuals are in 
danger of becoming 
extinct, forever severed 
from the evolving 
IT career world. Accord- 
ing to this third stereo- 
type, moving into a 
managerial position lim- 
its career options since 
technical competencies 
erode quickly in such a 
non-technical position. 
Figure 3. Combined career stages Therefore, this stereotype 
by motive profiles. views IT professionals as 
not valuing non-techni- 
cal/managerial competencies and assumes manage- 
ment positions are career dead-ends, thus limiting 
the pool of potential IT managers. In the words of 
one IT professional who made the transition to 
manager, “Most of the young-gun technology talent 
out there say they tolerate the ‘suits’ but have no 
aspirations to become one” [7]. 

To explore Stereotype #3, we used data based on 
participants agreement with three statements about 
the value of their competencies. For example, partici- 
pants rated the following statement on a 5-point scale 


tayers and Older-Movers 
(N=103) 


B® High Maintenance 


Rather than rely on simple generalizations, managers must 


evaluate an intricate combination of motives in conjunction with career 


stages to predict the needs of individual IT professionals. 


(where 1 meant “strongly disagree” and 5 meant 
“strongly agree”): 

In general, compared with others whom employers 
would consider for jobs that would interest me, my 
knowledge, skills, and abilities are relatively attractive. 

We obtained an average score across these compe- 
tency items for each IT professional. We used that 
score to compare those in managerial (non-techni- 
cal) positions with those in non-managerial (techni- 
cal) positions. In contrast with the stereotype, 
managers indicated their competencies were more 
valuable to them (with a rating of 4.3) than did 
those whose jobs were more technically focused 
(with a rating of 3.9). IT professionals with manage- 
ment responsibilities appear to define themselves in 
terms of their management experiences [8] and not 
their technical competencies. 

The major implication from this analysis is that 
HR practices should focus on developing not just 
technical skills but also an awareness of the possibili- 
ties of a wide range of career paths. IT professionals 
who have not moved into managerial ranks might 
benefit from understanding the value of the compe- 
tencies they could develop as managers. Those being 
groomed for managerial positions should not only be 
given managerial skills training but should also be 
provided with a broader perspective on the motiva- 
tions and value of being a manager, through career 
planning exercises or access to managerial-level men- 
tors. The organization will then benefit from a greater 
pool of internal candidates to fill open managerial 
positions. 


CONCLUSION 

The research highlighted in this article provides evi- 
dence that IT professionals are more complicated 
than the stereotypes we examined indicate. Contrary 
to these stereotypes, IT professionals possess a diver- 
sity of motivations that cut across age and organiza- 
tional tenure profiles. Rather than rely on simple 
generalizations, managers must evaluate an intricate 
combination of motives (such as achievement, secu- 
rity, and flexibility) in conjunction with career stages 
to predict the needs of individual IT professionals. 
Paying attention to such complexity, however, 


should have its rewards if organizations wish to insti- 
tute IT HR practices that align with their IT profes- 
sionals’ needs. Specifically, the allocation of scarce 
organizational resources should be more efficient 
and effective by targeting different resources toward 
supporting IT professionals with dissimilar needs 
rather than directing all types of resources indiscrim- 
inately at all IT professionals. Consequently, for the 
benefit of IT professionals and their employers, 
managers should look beyond the stereotypes and 
strive for a richer understanding of their IT profes- 
sionals. 
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IT Managers’ Requisite Skills 


Matching job seekers’ qualifications with employers’ skill 


requirements. 


he skills possessed by IT 
managers in an organiza- 
tion reflect the degree to 
which the organization 
can transform its IT investment 
into competitive advantage and 
new strategic opportunities. 
However, some organizations 
have complained that their IT 
managers do not possess the 
skills required for such opportu- 
nities. Furthermore, IT skills 
tend to become obsolete even 
faster than ever before [1]. To 
understand the most up-to-date 
skill requirements for contempo- 
rary IT managers, the study 
reported here collected and ana- 
lyzed 555 job advertisements 
posted on the corporate Web 
sites of Fortune 500 companies. 
It is overwhelmingly recognized 
that the Fortune 500 companies 
are highly influential on the 
demand and supply of IT human 
resources in the U.S., which is 
why we chose this group for our 
study: empirical research concern- 
ing the skills important to IT 
managers in large corporations. In 
planning their future careers, 
applicants for IT manager posi- 


tions in Fortune 500 companies 
need specific information about 
the skills required for employ- 
ment, advancement, and profes- 
sional accomplishment. While 
previous studies have examined 
the IT skill issue, this is one of the 
first that analyzed job ads from the 
Fortune 500. It is imperative for 
IT researchers and practitioners to 
investigate the skill requirements 
of these large corporations. 


Data COLLECTION 

iLogos Research [3] reports that 
81% of Fortune 500 companies 
announce their job openings on 
their corporate careers Web sites. 
The number of jobs on all of the 
Fortune 500 corporate Web sites 
is estimated to be three times 
greater than the number on any 
other online job boards, such as 
Monster and Hotjobs. Data was 
collected from Fortune 500 cor- 
porate Web sites every quarter 
from March 2001 through Feb- 
ruary 2003. For each quarter, no 
more than five job ads for IT 
managers were collected from 
each Web site. Part-time jobs and 
positions that seemed to be tem- 


porary or contract-based, such as 
IT project manager and IT test- 
ing manager, were excluded. A 
total of 555 job ads were col- 
lected: 182 in 2001, 299 in 
2002, and 74 in 2003. At least 
one job ad was provided by each 
of 201 different companies from 
42 states and Washington, D.C., 
during the two-year period. 


RESULTS 

The basic analysis was performed 
to ascertain qualification require- 
ments such as education, profes- 
sional certification, and ability to 
travel. From the applicant’s point 
of view, the basic requirements 
are explicit and necessary condi- 
tions to apply for the job. From 
the employer's perspective, these 
requirements are the gates that 
prevent unqualified applicants 
from applying for the job. 

While about 75% (418) of job 
ads for IT managers collected 
specified the education require- 
ments that should be met by their 
potential candidates, the remain- 
der (137 ads) did not. The major- 
ity of the ads were looking for 
people who held at least a bache- 
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lor’s degree (73.2%), as shown in 
Table 1. Only a small percentage 
of ads were looking for high 
school graduates (1.4%) or associ- 
ate degree (0.7%) holders. This 
might be due to the fact the job 
ads collected were only from the 
Fortune 500. However, 
the message is so clear 


phrases such as “some travel,” 
“willing to travel,” or “travel 
required.” IT managers in the 
Fortune 500 are generally 
expected to travel both domesti- 
cally and internationally. Note 
that the percentage of travel 


: . High School 
that IT job applicants or 
A 1a : —- si 
with high school diplomas Bachelor’s or Higher 406 73.2 
or associate’s degrees can Onion 137 247 
hardly expect to become Total 555 100.0 


IT managers in Fortune 

500 companies. The master’s 
degree, however, seems valuable 
for those seeking an IT manager 
job. Of the 555 job ads for IT 
managers, 132 ads (23.8%) men- 
tioned master’s degree holders as 
required or preferred. 

It is well known that certifica- 
tions make a positive impact on 
employers. The results indicate, 
however, that only 43 ads (7.7%) 
for IT managers mentioned certi- 
fication. Companies listed in the 
Fortune 500 do not seem to 
appreciate the value of certifica- 
tions for their IT managers. 
Moreover, these 43 ads empha- 
sized that a professional certifica- 
tion was not a requirement but a 
plus. The limited advantage 
appears to come mainly from 
project management certification 
rather than from software- or 
hardware-related accreditation. 

Of the 100 job ads (18%) that 
mentioned travel, 31 ads did not 
specify the percentage of working 
hours required for travel. These 
31 ads described their travel 
requirement by mentioning 


Table 1. Education requirements. 


requirement varies from 0 to 90. 
Twenty-eight percent of the 100 
job ads that mentioned “travel” 
were looking for road warrior IT 
managers, who are expected to 
travel more than 25% of their 
working hours. IT professionals 


51.4% 


Packages 285 General knowledge 

General knowledge of S/W 234 42.2% of management 

Database 173 31.2% Organization 

Operating systems/ 86 15.5% Leadership 
platforms Project management 

Programming language 78 14.1% Planning 

CASE 10 1.8% Monitor and control 

Training 


F arcneeceuraNetwore [338 [609%] socal | 513] 924% 


General knowledge of 158 28.5% 


architecture/network 


Interpersonal skill 
Communication skill 


Internet 152 27.4% —Independent/ 
Networking and devices 136 24.5% Self-motivated 
Client/Server 68 12.3% 

Mainframe 60 10.8% 

Security 55 9.9% 

LAN/WAN 40 7.2% 


Prtacware | 220 [39.0% | Business 1496 [89.4% | 


General knowledge 131 23.6% General knowledge 
of H/W of business 

Server 98 17.7% Function specific 

Desktop/PC 75 13.5%  Enterprisewide 


41 74% Industry specific 


Electronic business 


Devices/Printers/Storage 
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change their employers very often; 
in fact, the average job tenure in 
IT is reportedly about 13 months. 
The high level of travel required 
for IT managers might be a possi- 
ble contributor to the high 
turnover rate. 


IN-DEPTH ANALYSIS 

The list of skill require- 
ments presented in Table 
2 was developed based on 
the classification scheme 
proposed by Todd and his 
colleagues [4]. However, minor 
changes, such as the Internet and 
e-business, were added to reflect 
the time gap between the mid- 
1990s and the early 2000s. The 
figure here shows the overall 
trend by counting the number of 


Table 2. Number of ads for skills 
requirement. 


486 87.6% General knowledge of 368 
development 

425 76.6% \mplementation 366 65.9% 

361 65.0% Analysis 343 61.8% 

348 62.7% Knowledge of 339 61.1% 

338 60.9% technological trends 

308 55.5%  Operations/maintenance 339 61.1% 

145 26.1% Design 235 42.3% 
Methodologies 157 28.3% 
Integration 146 26.3% 
Quality assurance 145 26.1% 

458 82.5% — Documentation 124 22.3% 

426 76.8% Programming 72 13.0% 

126 22.7% 
Quantitative 274 49.4% 
Adaptive/flexible 268 48.3% 
Customer-oriented 268 48.3% 
Technical expertise 220 39.6% 
General problem 165 29.7% 
solving 

488 87.9% — Analytical/logical 107 19.3% 
Modeling 50 9.0% 

333 60.0% —Creative/innovative 39 7.0% 

284 51.2% 

130 23.4% 


op ape Be, 5 


/ 


ads that referred to each category 
at least once in the ad. At least 
60% of ads mentioned phrases 
related to all the skills, except 
hardware. IT managers in the 
Fortune 500 are required to have 
technical and system skills as well 
as business skills. This result is 
somewhat surprising because 
most previous studies have con- 
sidered the role of IT 
managers to be manage- 
rial and business ori- 
ented rather than 
technically oriented. 

Technical skills. Under 
the class of technical 
skills, there are three 
skill categories: software, 


to SPSS. Oracle and SQL-Server 
were referred to by 61 (11.0%) 
and 13 (2.3%) job ads, respec- 
tively. Phrases and words related 
to data warehousing were found 


in 24 ads (4.3%) for IT managers. 


All the skills in the architecture/ 
network and hardware categories 
did not receive much attention 

from the Fortune 500. Less than 


98.6% 98.4% 


924% 99.7% — 99.4% 


Problem Solving 
Architecture/ 


Management 
Development 
Network 


Business 
Software 


IT managers must be people- 
oriented. Both interpersonal 
(82.5%) and communication skills 
(76.8%) were required by more 
than three-quarters of job ads, 
whereas applicants described as 
independent/self-motivated 
(22.7%) were sought in fewer than 
one-quarter of ads. Communica- 
tion skills that appeared in the ads 
encompassed not only verbal 
skills, such as presentation 
and persuasion, but also 
non-verbal skills, such as 
writing and body language. 

The main function of an 
IT department is to support 
other business functions 
such as accounting, finance, 


architecture/network, 
and hardware. The 
results indicate IT man- 
agers in the Fortune 500 perform 
their work on software (81.3%) 
and architecture/network (60.9%) 
more than on hardware (39.6%). 
It is noteworthy that 338 job ads 
(60.9%) mentioned either phrases 
or words related to architecture/ 
network at least once in the ad. It 
is clear that IT managers in the 
Fortune 500 work less on hard- 
ware such as PCs, peripheral 
devices, and servers, but more on 
architecture- and network-related 
systems, such as computer net- 
working, client/server, mainframe, 
and security. 

Of the six skills under the soft- 
ware category, packages (51.4%), 
general knowledge of software 
(42.2%), and database (31.2%) 
received the most attention. Pack- 


ages ranged from Microsoft Office 


Num. of Ads 


Percentage of advertisements that refer to 
each category at least once. 


30% of job ads mentioned skills 
under these two categories. 

The Business skills class includes 
three categories: management, 
social, and business. The number 
of ads that mentioned each of 
these three skills at least once in 
the ad was 547 (98.6%) for man- 
agement, 513 (92.4%) for social 
skill, and 496 (89.4%) for busi- 
ness. Specifically, most attention 
was given to interpersonal skill 
(82.5%) as well as the general 
knowledge of business (87.9%) 
and management (87.6%). This 
result clearly indicates that For- 
tune 500 companies expect their 
IT managers to play managerial 
and behavioral roles. 
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marketing, and production. 
For this reason, it is not sur- 
prising that functional 
knowledge was mentioned in 
60% of the ads. It is surprising, 
however, that more than half the 
job ads for IT managers men- 
tioned enterprisewide knowledge 
(51.2%). IT managers are 
required to understand not only 
business functions but also the 
whole business processes of the 
enterprise that connects those 
independent business functions. 
This result seems to be somewhat 
related to the adoption of enter- 
prise resource planning (ERP) sys- 
tems such as SAP and Peoplesoft. 
ERP software has saturated the 
market, especially the Fortune 
500 market. Electronic business, a 
newly evolving concept, drew lit- 
tle attention, as only 9.7% of job 
ads mentioned “electronic busi- 
ness” or “electronic commerce.” 
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The System skills class includes 
two categories: development and 
problem solving. The figure indi- 
cates that phrases or words 
related to development and prob- 
lem solving appear at least once 
in 546 (98.4%) and 498 (89.7%) 
advertisements, respectively. 
However, no specific skill under 
these two categories turned out 
to be very critical for IT man- 
agers. This result can be inter- 
preted to mean that even though 
Fortune 500 companies do not 
typically require IT managers to 
possess specific skills for problem 
solving and development, they 
want their IT managers to have 
an overall understanding of the 
skills in these two categories. It 
should be noted, however, that 
such development skills as gen- 
eral knowledge of development, 
implementation, analysis, knowl- 
edge of technological trends, and 
operation/maintenance were 
mentioned in more than 60% of 
the ads, whereas no skill under 
the category of problem solving 
was mentioned in more than 
50% percent of the ads surveyed. 

Under the development cate- 
gory, the five skills that were 
mentioned in more than 60% of 
job ads were general knowledge 
of development (66.3%), imple- 
mentation (65.9%), analysis 
(61.8%), knowledge of techno- 
logical trends (61.1%), and oper- 
ation/maintenance (61.1%). 
Although the percentages of ads 
are not as high for these five 
skills, the 42.3% of requirements 
for design is high enough for IT 


managers to pay attention to it. 
For such purposes as budgeting, 
technological directing, and lead- 
ing technical projects, [T man- 
agers need to be very sensitive to 
technological trends. Such sensi- 
tivity is responsible for the 
employers’ interest in applicants 
who are adaptive/flexible 
(48.3%). It is very true that “the 
way to mastery is the way of a 
permanent beginner [2].” 
Almost 50% of job ads men- 
tioned three skills under the cate- 
gory of problem solving: 
quantitative (49.4%), 
adaptive/flexible (48.3%), and 
customer-oriented (48.3%). 
Quantitative skills include bud- 
geting, management science, 
mathematics, and statistics. The 
customer-oriented mindset is 
another requirement for IT man- 
agers who are otherwise very sus- 
ceptible to changes driven by 
new business environments as 
well as by new technologies. This 
result is also consistent with 
recent trends indicating an IT 
department should no longer be 
a cost center but rather a service 
center for organizational cus- 
tomers. Almost four out of 10 
ads mentioned technical expertise 
(39.6%). Although IT managers 
perform managerial tasks more 
than technical ones, it still seems 
important for them to have tech- 
nical expertise. For example, IT 
managers need to assign technical 
jobs to their subordinates and, 
furthermore, must understand 
technical requests from users as 
well as other IT workers such as 
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programmers, systems analysts, 
and network administrators. 


CONCLUSION 

This study represents one of the 
first attempts to investigate IT 
managers’ skill requirements by 
analyzing job ads posted by For- 
tune 500 companies. From the 
analysis of the results, we could 
come to the conclusion that IT 
managers should possess both 
behavioral and technical skills; it 
is not either/or but both. By col- 
lecting data continuously, this 
study will attempt to contribute 
to a longitudinal study to delin- 
eate changes in IT skill require- 
ments over time through 
turbulent changes in the business 
environment and technology. 
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Concordia University Wisconsin 
Computer Science Faculty Position 
Candidates must possess strong teaching skills 
and the ability to teach a range of computer 
science courses. The Computer Science pro- 
gram at CUW has a strong liberal arts em- 
phasis and a solid “CSO” foundation. Check 
hetp://www.cuw.edu/News_Facts/human_re- 
sources/index.html for more information and 
to download an application. Send a letter of 
application, curriculum vita, and completed 
application to: Director of Human Resources, 
Dept. CSC, Concordia University Wisconsin, 
12800 N. Lake Shore Dr. Mequon, WI 53097. 


DreamWorks Animation, SKG 
Pipeline Engineer 
DreamWorks Animation, SKG Seeks 
PIPELINE ENGINEERS Location: Glendale, 
Send 
pdi.com For a detailed job description, visit 


California resumes to: recruiting@ 
us at: www.dreamworksanimation.com. Click 


on "Jobs" 


GE Global Research 
Information Systems Engineer 

GE Global Research is engaged in R&D ef- 
forts in the area of Enterprise System and Ar- 
chitectural solutions as well as underlying Ar- 
tificial Intelligence techniques to support 
automated decision making for a wide variety 
of GE businesses. Responsibilities will include 
assessing current and future technology capa- 
bilities and needs in the area of Services In- 
formation Technology for the GE businesses, 
and developing next-generation system and 
architectural solutions for services and deci- 
sion support applications. For more informa- 
tion, and to apply to this position, please visit 
www.gecareers.com and enter 446083 in the 
job # field. Equal Opportunity Employer 


Helsinki Institute for 


Information Technology 
Computer Networks and Security 
Helsinki Institute for Information Technology 
A post-doc position for research and teaching 
is opened for a period 1.8.2006-31.12.2008 in 
the area of computer networks and security at 
HIIT, Finland. The position is at the Trust- 
worthy Internet excellence group focusing on 
trust overlays and Internet architecture in col- 
laboration with Prof. Scott Shenker at ICSI, 
Berkeley. The applicants should have a PhD de- 
gree in CS or EE and a track of ACM/IEEE pub- 
lications in the area. Full information is avail- 

able at http://www. hiit.fi/jobs/index.html. 


Inovec, Inc 
Software Development Manager 

Manage new product development team and 
lead development of new optimization algo- 
rithms and products applicable to sawmill in- 
dustry. Management and extensive engineer- 
ing experience in sawmill optimization and 
scanning systems required. 


Montana Tech of The 
University of Montana 
Instructor, Assistant or Associate 
Professor 
Montana Tech is seeking qualified applicants 
for tenure-track positions at the Instructor, 
Assistant or possibly Associate Professor level 
with expertise in software engineering, com- 
puter science, and/or information sciences/sys- 
tems starting in August 2006. An earned 
Ph.D. in Software Engineering, Computer Sci- 
ence, Information Sciences/Systems, and /or a 
closely related field with significant experi- 
ence in the aforementioned areas is preferred, 
but an ABD or M.S. will be considered. Pref- 
erence will be given to candidates having (1) 


prior university level teaching experience and 
(2) evidence of ongoing scholarly activity. 
Salary is commensurate with qualifications 
and experience. for more information visit 
www.mtech.edu/employment. Send a letter of 
application, brief statement of professional 
goals and current scholarly activity, statement 
of teaching philosophy, curriculum vita, grad- 
uate transcripts, and the names of three pro- 
fessional references to: Cathy Isakson, Mon- 
tana Tech Personnel Office, 1300 West Park 
Street, Butte, MT 59701 or e-mail to cisak- 
son@mtech.edu 


Oxford Asset Management 
Senior Software Engineer 
We are looking for extremely strong develop- 
ers and researchers who can routinely complete 
tasks in a fraction of the time most program- 
mers think possible. Candidates with a couple 
of years of experience will be considered and 
large-scale systems experience is highly desir- 
able. Successful candidates will have both 
strong analytical and organizational skills; ex- 
ceptional programming skills; a true love of 
building quality software; strong numerical 
strong knowledge of 
computational numerical algorithms, 


programming skills; 
linear 
algebra, and statistical methods; experience 
working with large data sets; and a team spirit. 
You will be joining an energetic group of 20, 
including a majority of PhDs from top uni- 
versities. We want to hear from you if you are 
ambitious and would relish the challenge, the 
opportunity, and the compensation offered. 


Polytechnic University 
Computer and Information Science 
Department 
Polytechnic University Computer and Infor- 
mation Science Department Professor of Com- 


ACM POLICY ON NONDISCRIMINATORY ADVERTISING 
@ ACM accepts recruitment advertising under the basic premise the advertising employer does not discriminate on the basis of age, color,race, religion, gender, 
sexual preference, or national origin. ACM recognizes, however, that laws on such matters vary from country to countryand contain exceptions, inconsistencies, 
or contradictions. This is true of laws in the United States of America as it is of other countries. # Thus ACM policy requires each advertising employer to 
state explicitly in the advertisement any employer restrictions that may apply with respect to age, color, race, religion, gender, sexual preference, or national 


origin. (Observance of the legal retirement age in the employer's country is not considered discriminatory under this policy.) ACM also reserves the right to 
unilaterally reject any advertising. @ ACM provides notices of positions available as a service to the entire membership. ACM recognizes that from time to time 
there may be some recruitment advertising that may be applicable to a small subset of the membership, and that this advertising may be inherently 
discriminatory. ACM does not necessarily endorse this advertising, but recognizes the membership has a right to be informed of such career opportunities. 
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puter Science in Cyber Security: Polytechnic 
University invites applications who will take 
on a leadership role in the area of cyber secu- 
rity. The candidate is expected to obtain re- 
search and education grants, develop industry 
relationships, and be active in publishing. The 
position requires a strong demonstrated re- 
search capability with the ability to teach ad- 
vanced topics. Both senior and junior profes- 
sors will be considered for this position. The 
individual will join the research group on cy- 
ber security which has been very successful in 
obtaining research and education grants. The 
cyber security initiatives at Polytechnic Uni- 
versity have grown to over $10M dollars in 
the last few years. The school is an NSA Cen- 
ter of Excellence in Information Assurance Ed- 
ucation and has received two rounds of fund- 
ing in the Scholarship for Service (SFS) 


program. Currently 20 students at the BS and 
MS level are in this program in addition to an 
extensive number of additional students tak- 
ing security courses. Over a dozen security 
courses are offered regularly and an on-line 
graduate level cyber security certificate pro- 
gram is also available. A state of the art labo- 
ratory in Information Systems and Internet Se- 
curity has been developed, and used by both 
undergraduate and graduate students for re- 
search and education. Current research focus 
of the program at the MS and PhD level is on 
trusted hardware, trusted software systems, 
digital forensics, multimedia security, bio- 
metrics, application security, network secu- 
rity, etc. The Computer and Information Sci- 
(CIS) 
University has a strong faculty with a vibrant 


ence Department of Polytechnic 


research program and strong course offerings 


in a wide area of computing. Polytechnic Uni- 
versity is an equal opportunity/ affirmative ac- 
tion, equal access employer and especially en- 
courages applications from minorities, women 
& individuals w/disabilities. Please submit a 
CV, Research Statement and the names of 
three references to: Professor Stuart Steele, Cy- 
ber Search Committee Polytechnic University 
Six Metro Tech Center Brooklyn, NY, 11201 
You may also send the information electroni- 
cally — securitysearch@poly.edu 


Singapore Management 
University 
Assistant/Associate Professor/Professor 
The School of Information Systems (SIS) at the 
Singapore Management University invites ap- 
plications from faculty candidates at all levels 
with research and teaching interests in the fol- 


The CREATE-NET Research Institute announces openings at all levels for 


post-doc researchers 
senior researchers 
project managers 
research directors 


in the areas of 


pervasive computing/communications; wireless, optical and broadband 
communications/networking; multimedia; security; interaction and smart 


environments; ICT for communities and business; nanotechnologies; satellite-based 
networking and bio-engineering. 


CREATE-NET is a dynamic European research institute, with a special mission ranging 
from top quality research, to technology transfer, leading to new products, services and 
start-ups. CREATE-NET is engaged in technical collaboration with major industry 


partners, and coordinates European funded projects in the above areas. Being at the core 
of a research network of over 250 partner institutions in Europe, with funded 
collaboration worldwide at leading institutions including US, China and Israel, CREATE- 
NET offers qualified candidates a unique international working environment. 


CREATE-NET is located in Trento, an area envisioned as Italy’s Silicon Valley, 
featuring a top-ranked university and a number of high technology research centers, 
including ITC-irst and Microsoft’s newest European research center on bioinformatics. 
Trento is located in northern Italy, in the beautiful Dolomites area famous for its history, 
culture, cuisine, lakes, mountains, skiing and other recreational activities and among top 
ranked cities in quality of life. 


Interested candidates are invited to send a copy of their CV in English, a statement of 
their objectives and a list of three references to careers@create-net.org. 
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Career Opportunities 


lowing areas of specialty: Data management 
and business intelligence, Information secu- 
rity and trust, Business software and architec- 
ture, E-commerce and supply chain systems, 
and IS resource management. 

For a complete job description and appli- 
cation information, please view our faculty ap- 
pointments page and faculty recruitment 
brochure. Please visit our website at www.sis. 
smu.edu.sg. 


Tel-Hai Academic College 
Lecturer 
A full time faculty position in Computer Sci- 
ence to begin October 2006. We seek appli- 
cants to teach undergraduate courses in com- 
puter science at all levels, tutor student 
projects, and to conduct research, 


The United States Military 
Academy (USMA) 
Assistant Professor Computer Science 
Dept 
The United States Military Academy (USMA) 
seeks applicants for the position of assistant 
professor of computer science in the Depart- 
ment of Electrical Engineering and Computer 
Science. Junior associate professors will also be 
considered. Applicants must hold a Ph.D. and 
have a record of scholarship in Computer Sci- 
ence or a related discipline. Teaching experi- 
ence is desired, particularly in operating sys- 
tems and related courses. Practical background 
in the design and operation of computer sys- 
tems is beneficial. USMA, the oldest engi- 
neering school in the nation, is situated 50 
miles north of New York City in the pic- 
turesque Hudson Valley. We offer a Tier I un- 
dergraduate program devoted to educating 
leaders of character committed to the values 
of Duty, Honor, and Country. USMA’s rigor- 
ous programs, quality faculty, and student-to- 
faculty ratio of 8:1 attract a student body of 
4,000 of the nation's best. The Department of 
Electrical Engineering and Computer Science 
includes a team of 52 civilian and military fac- 
ulty members dedicated foremost to teaching 
and student development. Faculty and stu- 
dents have excellent opportunities for research 
supported by a ubiquitous campus computing 
environment and laboratory facilities among 
the best in the nation. The department offers 
ABET accredited majors in Electrical Engi- 
neering and Computer Science, an Information 
Technology major currently seeking accredi- 
tation, two minors, and two core courses in 
IT. Between 40 and 70 majors graduate from 
our programs each year. We are looking for 
well-rounded professionals capable of design- 
ing, teaching, and overseeing a variety of 
courses and programs, conducting research, 
serving as mentors to junior faculty members, 


supporting students in and outside the class- 
room, and contributing to department and 
academy governance. USMA offers renewable 
contracts in lieu of tenure. Salaries are com- 
petitive. Applications will be accepted until 
the position is filled. Applications and ques- 
tions may be directed to COL Eugene K. 
Ressler, United States Military Academy, De- 
partment of Electrical Engineering and Com- 
puter Science, MADN-EECS, West Point, 
New York, 10996-1787, (845) 938-5582, 
ressler@usma.edu. The United States Military 
Academy is an Equal Opportunity, Affirma- 
tive Action Employer. Women and Minorities 
are encouraged to apply. 


University of Chicago 
Department of Computer Science 
The Department of Computer Science at 
the University of Chicago and the 
Mathematics and Computer Science (MCS) 
Division at Argonne National Laboratory 
are recruiting for a joint position. 

The University appointment may be at any 
faculty rank and is tenure track; the appoint- 
ment in MCS will have a similar rank. We are 


particularly interested in candidates in dis- 
tributed computing; large- scale data systems, 
file systems, and data mining; systems soft- 
ware for large-scale and unconventional sys- 
tems; and architectures, languages, and com- 
pilers for high-performance computing. 

The University of Chicago has the highest 
standards for scholarship and faculty quality, 
and encourages collaboration across disciplines. 
Argonne, the first U.S. national laboratory, op- 
erates world-class research programs and facil- 
ities across the physical and biological sciences. 
MCS has particular research strengths in com- 
putational science, parallel computing, dis- 
tributed computing, and applied mathematics. 
The university's Department of Computer Sci- 
ence currently has two faculty whose primary 
appointments are at Argonne, and this ap- 
pointment is intended to further strengthen 
the connection between these institutions. The 
successful appointee may also join the Com- 
putation Institute, a joint Argonne-Chicago 
enterprise created to address the most chal- 
lenging problems arising in the use of strate- 
gic computation and communications across a 
broad spectrum of intellectual activities. 


VCU 


Virginia Commonwealth University 


PROFESSOR AND CHAIR 
OF COMPUTER SCIENCE 


Virginia Commonwealth University invites applications for the position of Professor and Chair of 
Computer Science. The Computer Science Program has offered baccalaureate, certificate, and master's 
degrees for over 20 years. It was the first in the state to become accredited by ABET in 1988. In Fall 
2001, the program became part of the School of Engineering. At that time, the School of Engineering 
initiated a Ph.D. program in Engineering. Computer Science, one of six programs of study offered by 
the VCU School of Engineering, currently has nine faculty members with research interests in the areas 
of Software Engineering, Networking, Software Testing, Medical Applications, Database, Neural 
Networks, Parallel Programming and Programming Languages. The Computer Science Program has 
strong ties to the Bioinformatics Program in Life Sciences and an excellent working relationship with 
both Information Systems and Computer Engineering. The Chair manages departmental expenditures, 
and supervises assessment and improvement of the program to maintain ABET accreditation. 


Candidates for this position must be eligible for employment in the Unites States and indicate their 
citizenship or visa status. A Ph.D. in Computer Science or related field is required. Candidates for 
this position must display a strong record of research in computer science that can support the 
teaching and research missions of the Computer Science Program. The faculty are committed to 
maintaining a standard of excellence in undergraduate teaching while expanding research activities 
in conjunction with the newly instituted Ph.D. program. Information on the School of Engineering 
is available at http://www.egr.vcu.edu. 


Evaluation of applications will continue until successful candidates are selected. Applicants should 
send a statement of their teaching and research interests, curriculum vitae and contact information 
for at least four references to: Dr. Susan Brilliant, Search Committee Chair, Computer 
Science, P.O. Box 843068, Richmond, VA 23284-3068. 


VCU is a culturally diverse, urban university that benefits from a rich variety of cultural 
opportunities. Richmond is centrally located two hours from the mountains, the beach and 
Washington, D.C. 


Virginia Commonwealth University is an equal opportunity/affirmative action employer. 
Women, minorities, and persons with disabilities are encouraged to apply. 
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The Chicago metropolitan area provides a 
diverse and exciting environment. The local 
economy is vigorous, with international stature 
in banking, trade, commerce, manufacturing, 
and transportation, and the cultural scene in- 
cludes diverse cultures, vibrant theater, a 
symphony, 
opera, jazz, and blues. The University is located 


world-renowned and excellent 
in Hyde Park, a pleasant Chicago neighbor- 
hood on the Lake Michigan shore. Argonne is 
located in the southwestern suburbs of 
Chicago, with easy access to all the city's cul- 
tural and educational benefits. The two insti- 
tutions are linked by a frequent free shuttle. 
Please send nominations or applications to: 

Professor David B. MacQueen, Chairman 

Department of Computer Science 

The University of Chicago 

1100 E. 58th Street, Ryerson Hall 


Chicago, IL. 60637-1581 


or to apply-072033@cs.uchicago.edu (attach- 
ments can be in pdf, postscript, or Word). Please 
quote ref. # 072033 Argonne/CS joint apt. 
Complete applications consist of (a) a cur- 
riculum vitae, including a list of publications, 


(b) three letters of reference sent to recom- 
mend-072033@cs.uchicago.edu —_ (including 
one that addresses teaching ability), and (c) a 
research and teaching statement that discusses 
both past research and future plans. Applicants 
must have completed, or will soon complete, 
a doctorate degree. Applications must arrive 
by April 30, 2006. The University of Chicago 
and Argonne National Laboratory are equal 
opportunity/affirmative action employers. 


University of Houston 
Department of Computer Science 
University of Houston , Department of Com- 
puter Science invites applications for multiple 
tenure-track faculty positions starting in Au- 
gust 2006. A wide range of research interests 
will be considered with an emphasis on com- 
puter graphics, databases, software engineer- 
ing, human-computer interaction, robotics, 
systems and theory. Preference will be given 
co candidates at the Assistant Professor level 
but exceptional candidates at all levels will get 
full consideration. Candidates should hold a 
Ph.D. in Computer Science or a closely related 
field. For full details see. *http://www.cs. 


Advertising in Career Opportunities 


How to Submit a Classified Line Ad: Send an e-mail to 
acm-advertising@acm.org. Please include text, and indicate the issue/or 
issues where the ad will appear, and a contact name and number. 


Estimates: An insertion order will then be e-mailed back to you. The ad 
will by typeset according to CACM guidelines. NO PROOFS can be sent. 
Classified line ads are NOT commissionable. 

Rates: $240.00 for six lines of text, 40 characters per line. $80.00 for 
each additional three lines. The MINIMUM is six lines. 

Deadlines: The closing date is on the 21st of the month, five weeks prior 
to the publication date of the issue (which is the first of every month). 
Career Opportunities Online: Classified and recruitment display ads 


receive a free duplicate listing on our website at htto://vww.acm.org/ 
cacm/careeropps/. Ads are listed for a period of six weeks. 


For More Information Contact: 


WILLIAM KOONEY 
Account Manager 
at 212-626-0687 or kooney@hq.acm.org 
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uh.edu/positions/faculty.shtml *UH is an 


EO/AA employer. 


University of Illinois at Chicago 
Peter and Deborah Wexler Chair in 
Information Technology 
The College of Engineering at the University 
of Illinois at Chicago is pleased to invite ap- 
plications and nominations for the Peter and 
Deborah Wexler Chair in Information Tech- 
nology. The Wexler Chair is the first endowed 
chair in the UIC College of Engineering and 
was made possible by a $2 million gift from 
alumnus Peter Wexler. The University of Illi- 
nois at Chicago is one of Illinois’ two public- 
assisted comprehensive research universities, 
and the largest university in the city of 
Chicago. UIC has about 25,000 students 
(16,000 undergraduates, 6,800 graduate stu- 
dents, and 2,200 professional students) with a 
budget of $1.3 billion and research expendi- 

tures exceeding $250 million. 

The College of Engineering seeks a world- 
renowned researcher with distinguished uni- 
versity and/or industrial credentials and expe- 
rience in information technology, who may 
have joint appointments in the Department of 
Electrical and Computer Engineering and the 
Department of Computer Science. This is a 
tenured faculty position at a senior rank 
(Ph.D.) beginning either August 16, 2006 or 
January 1, 2007. The College of Engineering 
is seeking an individual who, in addition to 
bringing increased visibility to UIC, will lead 
various large, collaborative research projects 
involving many college faculty members and 
who will create and direct a large research cen- 
ter in the area of information technology. 

The UIC College of Engineering has 115 
faculty members and plans to increase that 
number to 130; the college enrolls 1,550 un- 
dergraduate and 850 graduate students and 
provides an exceptional engineering education 
in the heart of the world-class city of Chicago. 
The college is composed of six academic de- 
partments: Bioengineering, Chemical Engi- 
neering, Civil and Materials Engineering, 
Computer Science, Electrical and Computer 
Engineering, and Mechanical and Industrial 
Engineering. More than 40 of our faculty 
members hold the rank of Fellow in their pro- 
fessional societies. Twenty are recipients of Na- 
tional Science Foundation CAREER or Young 
Investigator awards, and two are members of 
the National Academy of Engineering. The 
college’s research expenditures totaled over $21 
million in the 2004-2005 academic year. 

The Department of Computer Science has 
181 B.S. students, 101 Ph.D. students, and 
151 M.S. students. Its 29 faculty members, 
eight of whom are recipients of CAREER 
awards and five of whom are Fellows of pro- 


Career Opportunities 


fessional societies, are engaged in research in 
the areas of artificial intelligence, bioinformat- 
ics, databases, multimedia systems, real-time 
systems, security, software engineering, theory, 
virtual reality, and visualization. The depart- 
ment’s research expenditures in the 2004-2005 
academic year totaled $6.8 million. 

The Department of Electrical and Computer 
Engineering has 514 B.S. students, 99 Ph.D. 
students, and 77 MLS. students. Its 30 faculty 
members, six of whom are recipients of CA- 
REER or Young Investigator awards and ten 
of whom are Fellows of IEEE, are engaged in 
research in the areas of bioelectronics, device 
physics and electronics, information systems, 
and computer engineering. The department’s 
research expenditures in the 2004-2005 acad- 
emic year totaled $3.2 million. 

To apply for this position, please send a let- 
ter of application, a resume, contact informa- 
tion for three to five professional references, 
and a statement of research and teaching in- 
terests to: 

wexlersearch@uic.edu 

or 
Professors Ouri Wolfson and 

Gyungho Lee 

Co-Chairs, 
Wexler Chair Search Committee 
Attn: Nancy Singer 
College of Engineering (MC 159) 
University of Illinois at Chicago 
851 South Morgan Street, 8th Floor 
Chicago, IL 60607 


For fullest consideration, applications 
should be submitted by May 1, 2006. The 
University of Illinois at Chicago is an Affir- 
mative Action/Equal Opportunity Employer 
and particularly encourages applications from 
women and members of under-represented 
minority groups. For additional information 
on the college, please go to http://www.uic. 
edu/depts/enga/. 


University of Kentucky 
Assistant Professor 
The University of Kentucky Computer Science 
Department invites applications for a tenure- 
track position beginning August 15, 2006 at 
the assistant professor level. Candidates should 
have a PhD in Computer Science. Review of 
credentials will begin on March 1, 2006, and 
the search process will continue until a suitably 
qualified candidate is found. We are especially 
interested in candidates with expertise in data- 
bases, preferably specializing in data mining, 
very large databases, networked databases, 
XML databases, image databases, or related 
topics. A successful candidate must be able to 
teach both undergraduate and graduate classes 
in databases and will be expected to conduct 
innovative research. Potential candidates must 
apply online at http://www.uky.edu/UKjobs/. 
Click 'Online Employment for Job Seekers'. 
Then on 'Search Postings! then enter SL511481 
under ‘Requisition Number'. Application 
deadline is March 17, 2006 but may be ex- 
tended if necessary. For any questions related 


information see: 


Professorship (W3) 
Information Services 


University of Stuttgart, Germany 


Applicants are sought for a full professorship in the Department of Computer Science, Electrical 
Engineering and Information Technology of the University of Stuttgart. The appointed professor will also 
be head of the Computing Center of the University of Stuttgart. The research and teaching expertise of 
applicants should encompass at least two of the following areas: large-scale web-based applications, 
middleware components, multi-media information management (e.g., media storage, (digital libraries), 
multi-media communication services, security, network and system management. The participation of the 
appointed professor in the university's centers of excellence and other research initiatives is highly 
desired. Also appropriate management experience is required to run the Computing Center. For more 


http://www. f-iei.uni-stuttgart.de/aktuell/W3InformationServicesEn.html 


The University of Stuttgart wishes to increase the proportion of female academic staff and, for this 
reason, especially welcomes applications from women. Severely challenged persons will be given 
preference in case of equal qualifications. 
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to the application process, please contact: 
HR/Employment, 112 Scovell Hall, Lexington, 
KY 40506-0046, phone 859-257-9555, press 
"2", or by email at ukjobs@email.uky.edu. The 
University of Kentucky is an equal-opportunity 
employer and especially encourages applica- 
tions from women and minority candidates. 


University of Nevada, Las Vegas 
Assistant/Assoc. Professor School of 
Informatics 
Assistant/Associate Professor. School of Infor- 
matics, UNLV. Commencing Fall 2006. 
hetps://hr.unlv.edu/Jobs. EOE/AA Employer 


University of 


Waterloo 


x 


University of Waterloo 
Department of Electrical and Computer 
Engineering 
University of Waterloo: The Department of 
Electrical and Computer Engineering invites 
applications for faculty positions in most areas 
of computer engineering, software engineer- 
ing, and nanotechnology engineering, and in 
VLSI/circuits, information security, photonics, 
MEMS, control/mechatronics, 
processing, and quantum computing. Please 
visit hetps://eceadmin.uwaterloo.ca/DACA for 

more information and online application. 
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PAUL WATSON 


Inside Risks | Lauren Weinstein 


t was only a matter of time. We've come to expect 

almost anything imaginable to be sold on late-night 

TV infomercials—from feel-good “health” bracelets 

to “get rich quick” real-estate schemes. So I should- 
nt have been too surprised to stumble across a 3 A.M. 
full-hour ad for a firm offering biometric “appliances” 
(for legal applications only—the superimposed fine 
print notes—not responsible for customer misuse!). 

I couldnt help but be reminded of an old “Twilight 
Zone” episode in which motherless siblings choose 
their desired component parts for an android 
“mother’—this color eyes, that color hair, the perfect 
musical voice. That's what this grotesque infomercial 
was selling: artificial fingers, false eyeballs, voice simula- 
tors, full-face latex disguises. And much more. 

While the energetic announcer never came right out 
and said so, it was clear from the outset that the only 
possible real purpose for these devices was to defeat bio- 
metric identification systems. Admittedly, many such 
systems perform so poorly that it doesnt take rocket sci- 
ence to fool them, but the folks behind the commercial 
still made a very convincing pitch. They also appeared 
to be well versed on the current state of both biometric 
and associated spoofing technologies. The ad even 
made fun of the now infamous homemade “gummy 
bear” false fingerprint technique, noting how this firm’s 
professional, custom-made fingers (with exchangeable 
fingerprint tips) were automatically maintained at per- 
fect body temperature to fool most scanners (a pair of 
AAA batteries for heating the digit not included). 

The false eyeballs for sale are apparently especially versa- 
tile, as it was shown how they can be held in the hand for 
either retinal or iris scanning units, or even worn (as a sort 
of half-shell) over the user's own eye(s) for scanners that are 
pickier about such configurations (the latter a scenario 
straight out of James Bond movies). The voice simulator 
device being promoted—obviously for defeating voice 
ID systems—was perhaps the most mundane device in 
the lot, but was still a pretty slick all-digital affair, 
designed to be both utilitarian and utterly inconspicuous. 

At one point, it was suggested that a possible applica- 
tion for their products was for bad-spirited practical 
jokes. The ad’s simulated demonstrations included a 
man who discovered that his fingerprints and iris pat- 
terns had been usurped and discredited by an interloper 
using those products. Realizing that he had no way to 
change his own biometrics, he chopped off his index fin- 
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ger with a Ginsu knife and stuck a shish kebab skewer 
into his eye in frustration. Really funny stuff, huh? 

But it got even worse. In addition to their physical 
appliances product line, the same fine Bahamas-based 
firm offered what could only be described as a corre- 
spondence course in database hacking. They sug- 
gested that another type of fun could be yours by 
corrupting and manipulating biometric databases on 
either a small or large scale, all from the comfort of 
your home computer while sitting in bed. No false 
fingers or fake eyeballs required for this approach. 
Now that’s entertainment (broadband Internet con- 
nection and Windows XP required). 

By the end of this potpourri of identity horrors, my 
head was spinning. The commercial explained how to 
gather fingerprints to send in for finger fabrication, 
and noted how you could order a little infrared gadget 
that would collect iris and retinal data—even at a con- 
siderable distance without the knowledge of your tar- 
get—for completing that aspect of your order. They 
even suggested that prospective customers “ask the 
operator” at the 800 number about “odor index” and 
“DNA fabrication services” that were also available. 

It was with some relief that this nightmarish appari- 
tion of an ad finally ended, with a splash of garish syn- 
thesizer music and the usual promises of a money-back 
guarantee if not completely satisfied. If nothing else, I 
was again reminded of the risks of late-night burritos 
keeping me awake, and the mental hazards of tuning 
in satellite TV stations above channel 999. 

As I finally tried to doze off, visions of dismem- 
bered fingers and spinning eyeballs still invaded my 
thoughts, and I couldn’ help but muse on the impli- 
cations of that 60-minute descent into the identifica- 
tion inferno that I had just endured. 

For all of the promotional hype surrounding bio- 
metric ID systems, it’s probable that they're actually 
setting the stage for a whole vast new range of abuses 
and problems, which will make us long for the “good 
old days” of passwords that could be easily changed 
when compromised. Perhaps the nightmare was actu- 


ally only just beginning, after all. G 


LAUREN WEINSTEIN (lauren@pfir.org) is co-founder of People for 
Internet Responsibility (www.pfir.org). He moderates the Privacy Forum 
(www.vortex.com/privacy). 


Editor’s note: This April Fool’s edition of the “Inside Risks” column is 
intended for amusement purposes only. 
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It’s an exciting, fast-moving world, and Computing Reviews is your tool for 
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Shaping the future of the World Wide Web WWW2006 Platinum Sponsors 


The Web continues to transform the way we communicate, learn, msn* ‘ 
trade, and govern. Edinburgh, Scotland, is the place to be in May . 
for four packed days of leading speakers, workshops, exhibitions, 

panels, discussion and debate on the new web technologies and WWW2006 Gold Sponsors 
capabilities that are revolutionizing all our lives. WWW2006 will 


bring together business leaders, industrial technologists, academics, 
and standards bodies. ideal, <, Mercedes-Benz 
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its value for social change then this is an event you should attend. WWW 2 O @) 6.0 rg 
Also sponsoring WWW2006 


Conference Partners: School of Electronics and Computer Science (ECS) University of Southampton United Kingdom 
British Computer Society (BCS) United Kingdom + Association for Computing Machinery (ACM) United States - World Wide Web Consortium (W3C) « IFIP 


European 
Patent Office 


